Digital audio level normalization and dialnorm revisited
As the cable industry continues its transition to an all-digital offering, it is more important than ever to review a few key elements about the digital audio system utilized in North American digital cable. This article focuses on the elements that directly impact a cable operator’s ability to reproduce a more consistent experience (with regard to loudness) across digital simulcast channels and locally-encoded ad-insertion spots. This article is not meant to cover the entire subject of the Dolby audio coding system or the science behind loudness estimation.
A central issue is the particular numeric value carried in every AC-3 digital audio bitstream that is sent to your digital subscribers (actually it is carried in every coded audio frame of AC-3). This value is called the Dialogue Normalization (dialnorm) value. The dialnorm value has direct control over the reproduced output level of either the audio decoder in the digital set-top box or in a home theatre system that may also be connected to the digital audio output of the digital set-top box. Many cable engineers have a very good understanding of this parameter already; however, I would like to further “demystify” the dialnorm value and how it interacts with the decoder.
The dialnorm parameter within the Dolby Digital (AC-3) stream is defined to indicate (i.e., signal) the long-term average level of spoken dialogue to the decoder (in your audio decoder within the digital set-top box or home theatre). Hence, the audio decoder utilizes this information to normalize (scale in a linear manner) the reproduced audio level as it outputs to a consistent (i.e., normalized) level. It is important to note that only a simple gain change made in the audio decoder and the amount of change (to the decoded audio level) is directly controlled by the dialnorm value that was provisioned in the encoder. So, in order for this “simple” and automated gain adjustment to work in the decoder, a target decode level had to be designed into the system. Then, the decoder could then read the dialnorm value from the incoming bitstream and compute the difference between it and the target decode level and subsequently apply this difference to the output levels.
So what is the target decode level? It turns out that for television applications, the system required two target decode levels, each of which has a specific application. These target decode levels acknowledge the fact that the audio decoder in your set-top is required to feed both line level outputs and a channel 3/4 remodulator to your subscriber’s reproduction system. Each of these outputs has very different capabilities with respect to the amount of dynamic range available and amount of headroom available above speech peaks, to name a few.
Dialnorm is an audio metadata parameter (carried within the AC-3 bitstream) that only indicates (i.e., it matches, only when set correctly) the long-term average spoken dialogue level of the audio program itself.
People throughout the cable industry often ask, “Why isn’t there a standard for digital audio levels?” Many don’t realize that there is. Consider the following. For DTV, the FCC mandates the use of ATSC A/53D, which in Section 5.5 states: “The value of the dialnorm parameter in the AC-3 elementary bit stream shall indicate the level of average spoken dialogue within the encoded audio program.” The word “shall” denotes a mandatory provision of the ATSC A/53D standard. With respect to North American Digital Cable, it is important to note that both ANSI/SCTE 43 and ANSI/SCTE 54 standards call out ATSC A/53 in their normative reference sections as well.
The dialnorm value is placed into the bitstream every 32 msec by the encoder, either under manual or via external metadata control; hence, this value is present in every AC-3 frame.
The dialnorm value is only validated by measuring the long-term average level of dialogue (within the program) to see if it agrees with the transmitted and/or encoded dialnorm value.
The dialogue level can be measured by utilizing integrating-averaging measurement devices that conform to IEC 60804 or ITU Rec. BS.1770. It is up to the operator of these devices to ensure that the results are in agreement with the actual long-term dialogue levels for the program. A commercial device is available that can automatically detect and measure only the dialogue segments of the measured program.
The dialnorm value has a finite range defined in ATSC A/52b of -1 to -31 relative to 0 dBFS.
The dialnorm value is provisioned in the encoder and always applied in the set-top box or home theatre decoder.
The dialnorm value within the AC-3 bitstream does not (typically) change on a frame-by-frame basis throughout the program to achieve normalization. Hence, it is only a single (scalar) value that represents the overall dialogue level of a program.
The dialnorm value is also utilized by the encoder Dynamic Range Control (DRC) subsystem to “calibrate” the position of the “null band” within the chosen compression profile. An improper dialnorm setting in the encoder could produce adverse effects, including an unintended shift in audio levels, on the decoded audio itself. –JR
It is obvious that RF mode operation introduces a +11 dB gain shift and the maximum possible peak to (dialogue) level ratio is reduced by 11 dB. This mode was specifically designed to match the average reproduced dialogue levels and dynamic range of digital sources to those of existing analog sources such as NTSC and analog cable TV broadcasts. This is achieved by compression and limiting internal to the AC-3 decoder (that is calculated in the encoder). On the other hand, Line mode was designed to allow wide dynamic range programming to be reproduced without any peak limiting and/or compression applied as may be intended by the original program producers.
If your set-top box has an RF output and is connected to the tuner input of a subscriber’s TV set, the audio decoder must be in RF mode. As stated above, this mode allows programming that has had its average dialogue levels produced at a much lower level (to make headroom for music and sound effects for dramatic effect) to match the average dialogue levels of your analog tiered (NTSC) channels.
Nomenclature for Line and RF modes utilized throughout the set-top industry can vary. User access (if available) to these decoder operating modes is typically available via the Guide or other resident application. Table 1 gives an overview of the nomenclature for two popular set-top manufacturers. I strongly recommend that you default all set-tops to RF mode! (If you are concerned about your high-end subscriber base, they will utilize the digital audio output on the STB to carry the AC-3 bitstream directly to their home theatre decoder. Hence, the decoder operating mode in the set-top has no effect on their playback system.)
Now that the target decode levels for each decoder operating mode are defined, we can complete our explanation of how the dialnorm value and the actual dialogue level within a program interact8 with them. Because the dialnorm value by definition relates to the level of speech, the target decoder levels can also be thought of as “normalized speech levels” with the assumption that the dialnorm value carried in the bitstream accurately represents the program’s actual speech level. For example, if we encode a 30-second commercial spot that has its actual average speech level measured at -22 dBFS and we set the dialnorm value to -22 in the audio encoder (which is correct for this spot), the speech will emerge correctly from a decoder operating in Line mode at -31 dBFS (-34 dBFS in each channel of a two-channel decoder) and emerge at -20 dBFS from a decoder operating in RF mode.
If we take the same spot and ONLY change the dialnorm value in the encoder to -31, then re-encode it, the speech will now emerge incorrectly from a decoder operating in Line mode at -22 dBFS (-25 dBFS in each channel of a two-channel decoder) and emerge at -11 dBFS from a decoder operating in RF mode (-14 dBFS in each channel of a two-channel decoder).
From this example, you can see that a simple change to the dialnorm value in your audio encoder can have a direct impact on the decoded audio level and you did not make any change to the input levels feeding the encoder!
In other words, in order for the decoder to properly “normalize” to the target level (which is centered on dialogue), the user responsible for encoding the audio must ensure that the measured dialogue levels match the dialnorm value in the encoder. If there is a discrepancy, efforts to converge the two values must be undertaken. For cable systems that have implemented digital-simulcast and digital ad-insertion encoding suites, this can be accomplished by simply correcting the dialnorm value or making an audio gain adjustment upstream from the encoder to make the actual dialogue level agree with the encoder’s dialnorm value. Therefore, someone trying to validate a program’s dialnorm value can only do so by measuring the long-term average level of speech (as per IEC 60804 Leq(A) or ITU-R BS.1770 specifications) to see if it agrees with the encoded dialnorm value carried in the stream.
“Why do we concentrate on the measurement of speech?” Industry and Dolby studies have found that listeners are generally more satisfied when program leveling is based on the dialogue segments within programs. Tests have shown that television viewers typically adjust playback volume to create consistent speech levels; that is, normalize the dialogue level for each program. In one of our studies we concluded that television viewers in a living room environment preferred dialogue to play back at a level consistent with actual speech. In another test, most viewers agreed within one dB of each other when asked to match dialogue playback level to a reference, yet varied widely when asked to do the same on other types of sounds9. Therefore, normalizing television levels based on dialogue will yield more consistency and agreement among subscribers.
The largest contributor to the gross level discrepancies found today among digital programming is that many of the broadcasters, programmers, and facilities that encode content locally or nationally do not set the dialnorm value (in the audio encoder) to agree with the actual measured level of the content’s dialogue. Given this, it is important to note that for digital television in the United States, the FCC mandates the use of ATSC A/53D, which in Section 5.5 states: “The value of the dialnorm parameter in the AC-3 elementary bitstream shall indicate the level of average spoken dialogue within the encoded audio program.”
The word “shall” in the previous statement denotes a mandatory provision of the ATSC A/53D standard. And with respect to North American digital cable, it is also important to note that both the ANSI/SCTE 43 and ANSI/SCTE 54 standards call out ATSC A/53 in their normative references sections. This clearly supports that there is a standard for provisioning audio levels. However, many don’t realize that this is made possible through metadata, specifically the dialnorm value.
In summary, by properly utilizing the dialnorm parameter, broadcast channels, ad-insertion spots, and even VOD assets with dialogue level produced at differing levels can be transmitted without any change to their actual levels or dynamic range, letting the decoder in the digital set-top box or home theatre take care of normalizing the level under control from the dialnorm value carried in the bitstream. As a result, the cable operator today can give the listener a seamless experience, whether the content is a Hollywood film with lots of headroom above dialogue for special effects, or a talk show with average dialogue level closer to maximum levels. The perceived loudness consistency that results means less aggravation for the subscriber, and fewer complaints to the operator. And that’s truly a win-win situation. A sidebar to this article (click here) provides level and dialnorm provisioning tips for your digital simulcast encoders, ad-insertion encoders, and remaining analog modulators.
However, a complete guide to measuring speech, setting dialnorm, and decoder operating mode details for cable television systems is available to registered users at www.dolbysupport.com  (registration is free).
E-mail: Jeffrey Riedmiller 
1. Dolby Digital (AC-3) is the standard audio compression system utilized for North American DTV and Digital Cable. See ATSC A/52B for more details.
2. The term “dBFS” is the absolute digital signal level with respect to full-scale where 0 dBFS is maximum.
3. See footnote number 2 above.
4. Scientific Atlanta Resident Application.
5. Some systems may choose to remove the ability for the subscriber to access/ change decoder operating modes via the DNCS. However, each set-top should be defaulted to “Narrow” mode.
6. Use of Wide mode is being deprecated and must not be used.
7. With the exception of overload protection, dynamic range control metadata (if present within the audio bitstream) is not applied in this mode.
8. Note: The dialnorm value also interacts with the Dynamic Range Control (DRC) subsystem within AC-3. However, due to space constraints and to simplify our discussion we will not describe (in detail) this subsystem and how dialnorm interacts with its behavior.
9. Riedmiller, Lyman & Robinson–115th AES Convention Paper 5900–“Intelligent Program Loudness Measurement and Control: What Satisfies Listeners?”
Digital simulcast encoder, ad-insertion encoder, analog modulator level measurement, and dialnorm provisioning guidance: Here we provide level and dialnorm provisioning tips for digital simulcast encoders, ad-insertion encoders, and remaining analog modulators, assuming the operator is utilizing an integrating-averaging measurement device that conforms to IEC 60804 or ITU Rec. BS.1770. A commercial device is available that conforms to these standards and that can automatically detect and measure only on the dialogue segments of the measured program.
|The figure above shows the measurement results taken from a digital simulcast channel that has dialnorm set incorrectly, showing a -32 dBFS measured speech level vs. a -27 dBFS dialnorm value. Given this condition, this program/channel/service will play back 5 dB quieter on average than it should. This problem is corrected by simply changing the dialnorm value (in the simulcast encoder) to -31 dBFS. Then this program/service will only play back within 1 dB of the correct level.|
Step 1. Set up the audio encoder.
- Provision encoder for the number of encoded audio channels you are providing for this service/channel. (2/0, 1/0, etc.)
- Make sure that the RF Overmod Protect flag in the advanced encoder settings is set to disabled.
- Select a dynamic range compression (DRC) profile: Film Standard, etc.
(Note: Film Standard is the most common profile used in practice.)
Step 2. Perform dialogue-based level measurements on each simulcast service.
- It is important that all digital simulcast measurements are taken from the encoded AC-3 stream at the S/PDIF output on the set-top box.
- Use an Infinite* measurement mode to determine average speech level over several extended periods to find the average speech level for each service. (Because a unique dialnorm value can’t be found and set for each program [in this application], we are only trying to find a reasonably accurate “static” dialnorm value for each service that represents that service’s average dialogue levels throughout the day.)
Step 3. Adjust the dialnorm value in your encoder to match long-term average dialogue level found in step 2.
Step 4. Set the dialnorm value for each simulcast channel/service based on measured speech levels from step 2.
Beware: Do not trust that the simulcast encoder’s default dialnorm value is correct for each of your services. And note that audio encoder input level adjustments may not be necessary–and only adjustments to the dialnorm value are needed.
AD-INSERTION ENCODER PROVISIONING:
Step 1. Always measure the speech level in the digital domain only.
- Okay to measure PCM audio at encoder input if available.
- Dolby Digital (AC-3) stream at STB digital output. (Preferred**)
- Use “infinite” mode for all measurements (make sure you “reset” the measurement device for each program being encoded).
- For advertisements, measure (integrate) the speech level over the entire program.
- For long-form content (VOD programs), measure (integrate) the speech level over the entire program. However, shorter measurement periods can be used (as long as the section(s) you are measuring are representative of the entire program).
Step 2. Adjust dialnorm value in encoder to match long-term dialogue level.
- Remember, encoder input level adjustments may not be necessary–and only adjustments to the dialnorm value are needed.
Step 3. Utilize a dynamic range compression profile.
- Select a dynamic range compression (DRC) profile: Film Standard, etc. (Note: Film Standard is the most common profile used in practice.)
Beware: Do not trust that the default dialnorm value (in your encoder) is correct for each spot you encode.
Step 1. Set all STBs to default to RF mode since most, if not all, have an RF remodulated output.
Step 2. Adjust all analog modulators such that the level of dialogue averages~-17 dBr Leq(A) on the plant.
- Where 0 dBr = 100% modulation (25 kHz peak deviation).
- All measurements must be taken directly off the combined RF plant, not the channel 3/4 remod output or baseband audio outputs.
- Note: I have found that some AGCs are very aggressive and can pull the speech level up too much (i.e. well above -17 dBr). Therefore many systems have disabled the AGC and have gone to manual mode.
- Use a short-term measurement mode while making adjustments.
• If your measurement device allows you to automatically “Surf” and log speech loudness values across channels, I suggest “surfing” in~six-channel blocks dwelling for 1 minute on each channel in the six-channel block. This yields~10 measurements per hour and will provide you with good data density.
• Once your analysis is complete, analyze the data to determine the best corrective action for the channel(s) in question. This could be a simple gain change in the modulator, if the channel is consistent from measurement to measurement.
* Infinite measurement mode computes the loudness over the entire measurement period– which is controlled by the user via the measurement “reset” and “pause” controls.