Following the resurrection of Dolby’s Dialog Intelligence with the new Netflix delivery specs and best practice recommendations, in this article, we ask whether Dolby’s algorithm is still fit for purpose as a speech-gating solution?
Why Do We Need To Be Able To Measure The Dialog Loudness In A Full Mix?
As I outlined in my article Has Netflix Turned The Clock Back 10 Years Or Is Their New Loudness Delivery Spec A Stroke Of Genius? when working with content with a wide loudness range, I suggested that...
We need a way to measure the dialog level. We could just take the centre channel, but that is only possible on 5.1 systems. Also, dialog can be in other channels of a 5.1 mix although I am not a fan of this, as divergence can result in comb filtering especially when it comes the downmix and have a negative impact on intelligibility.
What we need is an algorithm that can analyse a mixed track and extract the dialog. Then use that to measure the dialog component of the mix.
Netflix looked around and very understandably chose one that already had universal acceptance, albeit it hadn’t been used for quite a while and it is a proprietary algorithm. In talking to some developers they were surprised how good it was especially when you consider how old the code must be.
Netflix Chose To Use The Dolby Dialog Intelligence Algorithm
Following my article and the Netflix Respond To Our Detailed Article On Their New Loudness Delivery Specifications article in which Netflix’s Scott Kramer said…
“We found that under the previous measurement method, a title could pass the spec with subaudible dialog on many devices, so long as it contained loud FX and Music content to balance it out. This is the downside of the relative-level gate in 1770-4 (-2/-3): it tends to over-estimate the loudness of wide-dynamic-range content. We believe that dialog is the anchor element around which FX and Music are mixed. Just as the viewer sets their level based on dialog, mixers most often set a room level based on where the dialog “feels right.””
Before we move on, there is one concern that I have heard voiced by a number of key individuals following the announcement of the new Netflix delivery specs and recommendations and that is shouldn’t Netflix be measuring the speech-gated output with a BS 1770-4 meter, ie with relative gating and not BS 1770-1? Otherwise, they are rewinding the clock and breaking the generally accepted rules of standards that the new standard overrides and supersedes any previous versions.
Maybe Integrated Loudness Measurement Isn’t Ideal For Content With A Wide Loudness Range?
I decided to look into this and my research led me to an AES paper - Comparative analysis of different loudness meters based on voice detection and gating by Alessandro Travaglini delivered to the 134th AES Convention in Rome in May 2013, the abstract says…
“Although all standards rely on the same algorithm as described in ITU-R BS1770, there are still two possible ways to implement such metering, including voice detection and gating. These two different implementations might, in some cases, provide measurements that significantly differ from each other. Furthermore, while the gating feature is uniquely defined in the updated version of BS1770-3, voice detection is not currently specified in any standard and its implementation is the independent choice of manufacturers. This paper analyses this scenario by comparing the results and robustness provided by three different loudness meters based on voice detection. In addition, those values are compared with measurements obtained by using BS1770-3 compliant loudness meters, including tables, comments, and conclusions.”
One of the conclusions Alessandro came to, is that voice detection (Dolby Dialog Intelligence) and gating (BS 1770 relative gating) can significantly differ and by how much depends on the origin and loudness range of the programme being measured. Whilst that difference is still acceptable in case of programmes natively produced for broadcast with a narrow loudness range, the difference is no longer tolerable when comparing measurements of mixes originally crafted for theatrical presentations or content from the likes of Netflix. In these cases, the tests showed that the minimum difference was of 4.7 LU and up to 13.6LU, with median values of respectively -4.9LU and -9.0LU.
This confirms Scott’s research and testing of Netflix content that resulted in their new delivery specification. There is further confirmation of this in the EBU Tech 3343 document Practical Guidelines for Production & Implementation of R 128. In the summary in paragraph 5.1 it says...
The difference between measuring ‘all’ and measuring an anchor signal (such as voice, music or sound effects) is small for programmes with a narrow Loudness Range;
The difference between ‘all’ and ‘anchor’ measurements depends strongly on the content of the programme, but can be expected to be bigger if the Loudness Range is bigger;
Automatic anchor signal discrimination may perform well for a majority of programmes, but may be tricked by similar signals or may not trigger at all, thus not giving 100% consistent results;
Earlier on in paragraph 5.1 the EBU confirms the policy of measuring the integrated loudness of the whole programme but then goes on to say…
“For programmes with an increasingly wide loudness range (>12 LU, approximately) one may optionally use a so-called anchor signal for loudness normalisation, thus performing an individual gating method, so to speak.
There also exists an automatic measurement of one specific anchor signal in the form of ‘Dialogue Intelligence’, a proprietary algorithm of Dolby Laboratories, anticipating that speech is a common and important signal in broadcasting. The algorithm detects if speech is present in a programme and, when activated, only measures the loudness during the speech intervals. For programmes with a narrow loudness range the difference between a measurement restricted to speech and one performed on the whole programme is small, usually <1 LU. For programmes with a wide loudness range, such as action movies, this difference gets potentially bigger, sometimes exceeding 4 LU.”
But what about other types of content with a wide loudness range that don’t have a speech anchor like classical music, live concerts, or nature sounds. Well, the Netflix Audio Specifications & Best Practices v1.0 covers this in paragraph 3.1...
“When mixes measure at less than 15% dialog, program-gated measurement will be used instead (-24 LKFS +/- 2 LU ITU BS 1770-3)”.
Maybe The Dolby Dialog Intelligence Algorithm Isn’t All Its Cracked Up To Be?
Continuing with my research I discovered that there is some question as to how well the Dialog Intelligence algorithm works or not. In the AES Convention Paper 8983 presented at the 135th AES Convention in October 2013 entitled Level-normalization of Feature Films using Loudness vs Speech Thomas Lund and Esben Skovenberg, who were both working at TC Electronic at the time, said this in part of the abstract…
“We present an empirical study of the differences between level-normalization of feature films using the two dominant methods: loudness normalization and speech (“dialog”) normalization. Comparison of automatic speech measurement to manual measurement of dialog anchors shows a typical difference of 4.5 dB, with the automatic measurement producing the highest level. Employing the speech-classifier to process rather than measure the films, a listening test suggested that the automatic measure is positively biased because it sometimes fails to distinguish between “normal speech” and speech combined with “action” sounds.”
One of the main conclusions from their research showed that the automatic measurement, which used the Dolby Dialog Intelligence algorithm, on average read 4.5 LU higher than the actual dialog level. They determined the actual dialog level by extracting scenes from each one of 10 ‘blockbuster’ films editing the 5.1 film mix, regular speech segments – typically between 15 and 60 seconds that were identified by ear and then copied to a 5.1 assembly track. The segments were butt edited to form one continuous speech assembly which could then be measured using a BS 1770-3 meter. They compared these results with the results from the speech gated measurement using the Dolby Dialog Intelligence meter and found a typical difference of 4.5 LU, with the automatic measurement producing the higher result in all cases.
This means that in the case of the Netflix spec for a dialog-gated dialog loudness of -27 LKFS which for some, including me, believe is too low for domestic consumption, could mean that the actual dialog loudness is more like -31 LKFS, which is a full 7 to 8LU lower than the target loudness for broadcast delivery for full mixes for domestic consumption using the ATSC A/85 or EBU R128 specifications.
Studying the ATSC A/85-2013 standard, when it comes to measuring Loudness in section 5 it is interesting that although they refer to measuring the dialog anchor where possible, they do not advocate the use of the Dolby Dialog Intelligence, and that was also the case in the previous version of the ASTC standard A/85-2011. However, paragraph 5.2 on Making Loudness Measurements has been substantially rewritten between the 2011 and 2013 versions.
In paragraph 5.2.1 of A/85-2013 on “Measurement Of Long Form Content During Production or Post Production” the ATSC recommend putting a BS 1770-3 meter (1770-3 uses relative-gating, NOT speech-gating) on the Anchor Element (typically dialog). Then they go on to say…
“If the Anchor Element cannot be identified and measured on its own, then the long term loudness of all elements of the soundtrack, over the entire duration, should be measured and reported as the Dialog Level.”
Similarly, in paragraph 5.2.3 “Measurement of Finished Long Form Content”, they say…
“It may be difficult to identify and measure the loudness of the Anchor Element within a finished program. If possible, a section of the content that is representative of the Anchor Element should be isolated and measured. In the absence of a specific Anchor Element, the loudness of the element of the content that a reasonable viewer would focus on when setting the volume control should be measured. If neither technique is possible or practical, the loudness of all the elements of the content should be measured.”
My interpretation of this is that the ATSC are saying if you cannot put a BS 1770-3 meter on a dialog stem then don’t use the Dolby Dialog Intelligence algorithm, instead use a BS 1770 meter to measure the full program Integrated Loudness and use that as the Dialnorm metadata figure.
The ATSC advice here would appear to reinforce the findings of Thomas Lund and Esben Skovenberg and as the algorithm for Dolby Dialog Intelligence (speech-gating) and BS 1770-3 (with relative gating) have not changed since this research was undertaken, it is not unreasonable to suggest that the errors Thomas, Esben and others have found are still happening now, and will be happening with anyone working to the new Netflix delivery spec because it uses the Dolby Dialog Intelligence algorithm.
We surely need someone to take another swing at an algorithm that can extract the dialog loudness from a full mix?
How Could The Speech-Gating Algorithm Be Improved?
With the recent developments in technology undertaken by the likes of iZotope with Dialog Isolate and Music Rebalance, Audionamix with Xtract Stems 2, AudioSourceRE with DeMix Pro, and the speech separation technology Nugen Audio use in Halo Upmix, the technology is surely available to improve on Dolby’s 15-year-old dialog-gating algorithm?
But the technology is only part of the challenge to improve on Dolby’s Dialog Intelligence algorithm. Another challenge is who is going to pay for its development when there isn’t going to be a direct financial return on investment because the technology will need to be made freely available to all the manufacturers if it is going to be a standard used by everyone.
Let’s take a look at how this could be achieved as there is a precedent for this already. At the outset of the developments that would lead to the International Telecommunications Union (ITU) BS 1770 standard, there was an open contest between a number of manufacturers and universities including the Communications Research Centre (CRC - a federal research institute in Ottawa, Canada) and McGill University in Canada to produce the Leq measurement (K-weighting) system, which is the core of BS 1770.
Later in the history of the development of BS 1770 when Thomas Lund was head of Research and Development, TC Electronic decided to invest time into this and the outcome was that they donated the genre-agnostic measurement Loudness Range (LRA) to the EBU. This generosity has made LRA from the outset, unlike Dolby Dialog Intelligence, both open source and free to use.
Ideally, with any standards, it is better not to depend on technologies created by one commercial organisation. To get around this commercial challenge, perhaps the brands involved in loudness measurement could pool their resources and develop the algorithm together. Alternatively, perhaps an educational or research-based organisation could undertake the work for the ‘common-good’?
What if Dolby donated the source code to ITU and then have the community improve it?
What if the Fraunhofer Institute would get on board or perhaps the Communications Research Centre got back involved and helped with the development of a new speech gating algorithm?
Maybe a test could be devised to find out the design that does the best job, compared to hand-isolated speech, and that should then be the international baseline for automatic speech measurement?
However, there may be one more hurdle to overcome with the development of a new speech gating algorithm. We understand that there is a patent taken out by Dolby that may prevent anyone else from developing a speech gating algorithm for use with measuring loudness. If this is the case then there would need to be an agreement that Dolby would not invoke that patent. Or, maybe the best way forward is if Dolby would donate the source code to the ITU. Perhaps surprisingly, my understanding is that may not be as far-fetched as it might have have been a while back.
It is my understanding that Dolby’s position has changed and that alongside the likes of Fraunhofer Institute and DTS they are more interested in supplying end-listener processing, so there is hope here. At AES this year I learnt that there are even plans for a unified delivery system for immersive audio that is seeing the likes of Dolby and DTS coming together and working alongside the Fraunhofer Institute to develop infrastructure to deliver the best content possible as easily as possible to the consumer.
A Window Of Opportunity
As you can see, developing an improved dialog-gated algorithm is not without complications, but I believe that with the collaboration we have seen in the past with the likes of the Communications Research Centre and McGill University working together to produce BS 1770, TC Electronic donating the LRA code and now with Dolby, DTS and Fraunhofer working together on improving the very consumer delivery systems that Netflix and others depend on, together with the resurgence of speech gating loudness measurement, we have a window of opportunity before the new Netflix delivery specification takes hold, for interested parties to come together to develop and implement a new universal, more accurate speech-gating algorithm for use in dialog loudness measurement and for measuring dialog LRA, both of which are crucial especially when it comes to content being delivered to consumers with a wide loudness range as the likes of Netflix are choosing to do with OTT delivery platforms.