A modeling approach to perceptual video quality

Poor-quality video can lead to high operational and support costs, customer dissatisfaction, and revenue loss. If operators want to avoid those risks, they must take proper steps during the launch and lifecycle of video services to ensure high service quality. Those steps include adopting new methods of gauging video quality that take viewer perception into account.

The streaming video market is in full swing – high-definition (HD), IPTV, video-on-demand (VOD), TV to mobile and more. In the video services business, operators understand that there is little tolerance among consumers for poor video quality. A high quality of experience (QoE) for video will be a key component to successful customer acquisition and retention efforts.

Understand that ensuring the highest QoE requires an investment. But measuring QoE is vital in delivering excellent quality HD, IPTV, mobile video, or any other video services.

The Hermann grid
Figure 1: The Hermann grid is a visual illusion.

Quality may even be used as a service differentiator from competitors, much like it is now for mobile voice communication.

Good QoE cannot be accurately monitored with traditional QoS parameters or signal quality measurements alone. As a result, specific QoE practices must be put into play.

Impairments to video quality and QoE come from a wide range of sources, including source video, encoders/transcoders, the core and access IP network, home wiring and CPE devices. While impairments in the home and access networks represent the largest number of problems, each impairment impacts relatively few customers.

Headend problems represent the second largest problem area and are especially significant as troubles at the headend impact a large number of customers and often represent a large-scale service issue. QoE practices then need to span the delivery infrastructure to ensure the range of issues are detected and prevented.

Video QoE is heavily dependent on viewer perception, a difficult parameter to measure. For reliable QoE, we therefore need not only a detailed understanding of video delivery, but also of the human visual and auditory systems – how the typical person sees or hears a video program.

So, what’s an operator to do in order to come to grips with the perceptual quality of its video services as a real viewer sees it to be?

Metrics, metrics and more metrics
The image and video processing community has long been using mean squared error (MSE) and peak signal-to-noise ratio (PSNR) as fidelity metrics, and there are some sensible reasons for that. The formulas for computing them are as simple to understand and implement as they are easy and fast to compute.

Despite their popularity, MSE and PSNR only have an approximate relationship with the video quality perceived by human observers, simply because they are based on a byte-by-byte comparison of data without knowing what the data actually represents. Essentially, their findings are considered for what they are, and not for how it actually impacts video degradation. So, inferences are made, which are best guesses.

The network quality-of-service community has equally simple metrics to quantify transmission error effects, such as packet loss rate or bit error rate. These share similar reasoning in popularity for that of PSNR. But like PSNR’s shortfall, problems also arise with network-centric metrics when relating these measures to perceived video quality. They were designed to characterize data fidelity, but again do not take into account the content. This is a critical shortfall because it specifically relates to the meaning, the purpose, and thus the importance of the packets and bits concerned. The same number of lost packets can have drastically different effects depending on which parts of the bitstream are affected.

Perception is reality
Human vision is a complex system. This complexity can be summed up with one popular illusion, the so-called Hermann grid (see Figure 1).

The image shown is a static picture of large black squares and smaller white squares, yet as we let our eyes wander around, the white dots are perceived as black. This illusion is created by our brain, as the black dots are obviously not present in the image on paper or screen. It is this complexity in the human vision system that illustrates why any evaluation of viewer perception requires vision modeling.

Traditional quality measurements do not capture visual perception, as they are based on pure data and are not application-driven. They are distortion-agnostic. Depending upon the type and properties of a particular distortion, it may be more or less apparent or annoying to the viewer. For example, the human eye is not very sensitive to high-frequency noise, whereas a lower-frequency pattern is much more easily visible to our eyes.

Traditional quality measurements are also content-agnostic. For example, if a distortion in a video exists in just a part or region where a lot of image activity is occurring (such as lots of edges or motion), the image activity makes the distortion much harder to see there (a property referred to as “masking”). It helps “camouflage” it. In contrast, when an artifact occurs in a part or region of a video with little activity (such as a smooth region or low motion), where there is hardly any masking, the distortion stands out immediately. In this case, the content helps highlight it.

To combat such complexities, one must consider a vision modeling approach to perceptual video quality. As the name implies, this is done by modeling various components of the human vision system (HVS). The end goal of these HVS-based metrics is to try to add in aspects of human vision deemed relevant to picture quality. These include color perception, contrast sensitivity and pattern masking and are derived from models and data from psychophysical experiments.

The inclusion of human vision system modeling is paramount for a comprehensive and true QoE assessment. Consequently, subjective experiments play a central role in developing and evaluating automated measurement methods. In these subjective experiments, a set of test clips is shown to a large panel of viewers, and their quality ratings of the clips are collected. The individual quality ratings are then averaged into a Mean Opinion Score (MOS) for each clip. The procedures for subjective experiments are standardized by the International Telecommunications Union (ITU).

Full-reference (FR) versus non-reference (NR)
Figure 2: Full-reference (FR) versus non-reference (NR) measurement scenarios.

Applying QoE measurements
Scoring QoE and remedying problems in content integrity then leads to two other approaches in real-world service delivery – a lab environment or during live services.

The most common approach for lab-based quality measurement is known as full-reference, where the test video is directly compared with the reference. This comparison is done on a frame-by-frame basis and thus requires precise alignment of the two video sequences, which can be an issue if there is variable delay in the system. As a result, full-reference measurements are usually limited to short clips (5-10 seconds) in order to minimize alignment issues and obtain meaningful results. Due to the nature of this method, the video comparison is carried out in the decoded picture domain rather than the bitstream or transport stream.

The full-reference approach permits a very detailed analysis of video distortions and problems. However, because of the alignment issues and the need for a reference, it is usually restricted to offline lab use, such as codec comparison, tuning, or acceptance testing. ITU-T Rec. J.144 recommends a number of metrics using the full-reference approach. Unfortunately, it does not apply to packet networks and modern codecs such as H.264 or VC-1.

For in-service monitoring of video quality, there is another approach generally more suitable. It is a no-reference or reduced-reference approach. QoE measurements can be performed at any point in the system or network. This makes this approach well suited for monitoring or alarm generation of an operational video delivery system.

To take advantage of the information produced by no-reference probes at various points in the system, the measurements from the different points can be collected at a central location for further analysis. This is where measurements can be correlated – considering network impairments and content impairments against the HVS – for added accuracy. Having data available from different points in the system also permits a better root cause analysis and localization of problems that may arise.

The no-reference approach can look at the video in a variety of shapes, depending upon where the measurement actually takes place. Before or after the encoder/transcoder, the probe should look at either the decoded video or the encoded bitstream. In the network (after the streamer), the probe should look at the transport stream, in particular the PCR jitter, network losses, and also the bitstream. At this point, however, the content is often encrypted and cannot be accessed anymore; if this is the case, the centralized correlation can connect bitstream measurements from after the encoder with transport stream measurements from downstream in the network to establish the actual QoE impact of network problems on the video content.

QoE based on the human vision system is a critical element to ensuring video content integrity as it is delivered to customers. Effective monitoring using such models and approaches is not just vital to network operations, but to sustaining larger business objectives that include revenue and customer retention and growth assurances.

As service providers continue their adventures into video into the home or to a mobile device and cable operators grapple with the advent of HD programming and video-on-demand, those who will be successful understand these complexities and that proactive monitoring is critical.

It must be continually repeated throughout the life of a service, from the moment one decides to assess network capabilities to deliver new video services, through deployment, commissioning, and on through to supporting customer service level agreements (SLAs). Continuously refined assessment requires tools that measure deep into the network, including the video content itself, to ensure quality monitoring and service quality is in line with what a real viewer perceives and expects.