Ever since the earliest days of digital television, broadcasters have struggled to deliver a television signal with the sound and picture synchronized. That’s because the end-to-end chain from production to compression to storage to distribution to decompression consists of devices that each introduce some latency.

Plus, the audio part of the program and the video part of the program go through different devices in the chain, and are united finally in the TV receiver.

Early on, the worst offenders (in my experience) were the live news feeds from reporters in the field. But the problem wasn’t limited to broadcast signals. I recall watching a cable video-on-demand movie that had a noticeable lip synch problem. Anyway, things got better as the problem became better understood.

More than ten years ago, the Advanced Television Systems Committee issued the following recommendation for synchronization at the input to the broadcast encoding device: “The sound program should never lead the video program by more than 15 milliseconds, and should never lag the video program by more than 45 milliseconds.”

But that does not include the time delays inherent in the decoding process at the digital TV receiver. Those decoder delays introduced additional synchronization errors, because some of the earliest DTV receivers did not check the Presentation Time Stamps periodically while a program was being viewed. There was relative drift between the audio and video streams, but the receivers only re-synchronized the audio and video at a channel change.

Consequently, the Consumer Electronics Association adopted CEB- 20 (“A/V Synchronization Processing Recommended Practice”) in 2009. Basically, it says that receivers should constantly monitor the Presentation Time Stamps for the audio and video, and should keep them synchronized. CEB-20 is used for TV programs delivered to TV receivers and cable set top boxes, but not (so far as I know) for over-the-top video such as Netflix.

But those efforts only dealt with specified pieces of the complete end-to-end TV delivery chain.

Ideally, information should be incorporated into the stream at the time the audio and video are captured, and used by the receiver to synchronize the audio and video at the time they are rendered. The Society of Motion Picture and Television Engineers (SMPTE) has been working on just such a project for a number of years, and they recently reported on the status of that work.

SMPTE’s approach derives certain data from the video and audio at a point in the chain where they are known to be in synch and carries that data to the receiver. The receiver compares the known good data with the same data derived from the video and audio at the receiver and correlates that information to re-synchronize the video and audio.

SMPTE considered two approaches for carrying the data, fingerprinting and watermarking.

Watermarking hides the data invisibly within the video and inaudibly within the audio. Fingerprinting carries the data separately. SMPTE chose fingerprinting, for several reasons. Watermarking modifies the content in ways that might not be invisible or inaudible to all users or might impair the content, might not survive all signal processing, and might not coexist well with other watermarks.The video fingerprint is derived by downsampling a frame to SD resolution, then sampling the luminance of 960 specified points. The data is truncated to 8 bits, and compared with the samples from the previous frame. The number of samples that have changed by more than 32 is divided by 4, and the result is the video fingerprint.

Another sampling and truncation scheme is used to derive the audio fingerprint.

That’s all well and good, but in order for the concept to work, the video and audio fingerprints need to be derived at the receiving device (TV receiver or set top box) so that they can be compared with the values derived at the beginning of the chain, and used by the set top box or receiver to correct the synchronization.

How many boxes or receivers today have the circuitry needed to do this? None.

Can that capability be added to existing, deployed devices? Probably not. What is the incremental cost to add the circuitry to new devices? Unknown.

There’s another issue. SMPTE has defined the fingerprinting method, but not the methods for delivering it down the chain.

Broadcasters and cable systems today use MPEG Transport Streams to deliver the video and audio elementary streams and the related metadata. ATSC standard A/53 and SCTE standard ANSI/SCTE 54 define the method.

No work has begun to modify those standards to define the carriage of fingerprints.

Similarly, Over-The-Top programmers like Netflix use IP Transport. The next generation television system known as ATSC 3.0 will transport video and audio as files, using a different system than MPEG Transport.

Methods for carrying the fingerprint data needs to be defined and standardized for those transport methods.

Finally, a recent experience. I was watching a cable news channel that had a mosaic on the screen consisting of a base picture showing the moderator and six windows with individual commentators. One of the commentator windows had a noticeable lip synch problem.

So far as I can tell, the SMPTE approach cannot correct that problem in the viewer’s TV or set top box. There will still be a need for good lip synch “hygiene” at every point in the program distribution chain.