High reliability: Pulling back the datacom curtain

Tue, 10/31/2000 - 7:00pm
Tom Cloonan, CTO; and Don Brown, Director of Software Engineering, CADANT

In order for the Internet to become the communications network, datacom infrastructure suppliers will have to stop talking about "carrier class" and actually deliver telecom-grade products.

Why carrier-class reliability?

As Multiple System Operators (MSOs) begin to deliver enhanced services made possible by DOCSIS 1.1 and PacketCable 1.0, the long-term viability of those services, such as telephony and streaming video, is dependent on hitting and exceeding consumers' current reliability and quality expectations.

In fact, in most industries, customer expectations on reliability and service have never been higher. In 2001, DOCSIS 1.1 products will be delivered to the cable industry with features that will enable MSOs to deliver a broad range of new services to a much larger customer base. However, MSOs must demonstrate that they can meet a new standard of high reliability and high availability when providing these new services. What is the standard? Simple, say consumers: they want it to be just like their current phone service, with dial tone, availability, and acceptable quality virtually 100 percent of the time. How many people could make the same claim for their datacom services either at home or at work? As a result, it is the telecom definition of reliability and availability that has the datacom industry striving for similar "carrier-class" solutions. But what does "carrier-class" service really mean?

Carrier-class operation is not provided by simple duplication or redundancy of system-level components. These are necessary, but not sufficient, conditions to achieve high reliability and high availability. To be a true carrier-class system, the system architecture and design must provide a complete fault-tolerant solution to the customer. Carrier-class products such as next-generation CMTSs can almost always be sub-divided into two functional systems: the hardware system and the software system. A quantitative reliability analysis of a product will typically treat the hardware system and software system separately.

Analyzing reliability and availability in hardware systems

To analyze the reliability of any hardware system, it is helpful to understand several definitions, and it is also beneficial to examine a few simple mathematical formulae that are regularly used in hardware reliability analysis1.

Failure within a hardware entity is defined as the cessation of the ability to perform a specified function. In hardware systems, two types of failures can occur—failures resulting from uncorrected design errors, and those caused by degrading components. The latter failures should represent the highest percentage of hardware failures seen in the field in a well-designed system.

Reliability, R(t), is defined as the probability that an entity (component, sub-system, or system) operates without failure until time t, where t=0 typically marks the beginning of life for the entity. One can estimate the reliability of an entity by monitoring the operation of N entities and keeping track of No(t), the number of operational entities at time t. The Reliability estimate, Re(t), can then be approximated by Re(t) = No(t)/N.

Unreliability, F(t), is defined as the probability that an entity has experienced a failure prior to time t. It is easily derived from the reliability, because F(t) = 1-R(t).

The Failure Probability Density Function, f(t), is defined as the probability that an entity will fail at time t. It is equal to the derivative of the unreliability: f(t) = F'(t).

The Hazard Rate, h(t), is defined as the conditional probability that an entity will fail at time t given that it has already operated without failure until time t. This conditional probability is defined as h(t) = f(t)/R(t).

Figure 1: R(t), F(t), f(t) and h(t).

Figure 1 illustrates these four functions [R(t), F(t), f(t), and h(t)] for a typical hardware entity. There are three identifiable regions: the initial failure region, the random failure region, and the wear-out failure region. Most parts fail within the initial failure region (due to poor manufacturing) and within the wear-out failure region (due to material degradation), but these failures can be screened with burn-in testing at the factory and with scheduled maintenance (and replacement) in the field. Thus, carrier-class system design is primarily concerned with the effects of failures that occur during the random failure region.

For most hardware entities, it is acceptable to approximate the hazard rate, h(t), as a constant value throughout most of the random failure region. This constant value is often referred to as the failure rate, l, and is measured in units of failures per unit time. Failure rates are often described in units of FITs, which is a measure of the number of failures in 109 hours. The failure rate of an entity and the reliability of an entity can be shown to have the following important relationship: R(t) = exp[-lt].

There are several ways to obtain the failure rate for a component: 1) perform life cycle testing, 2) acquire the data from outside sources, and 3) estimate the value using values for similar parts. Real-world constraints often force a reliability analysis to rely on many estimates.

The failure rate of a subsystem can be determined from the failure rates of its components, because:

  • Rsubsystem(t) = P(subsystem is reliable at time t)
    = P(all components in subsystem are reliable at time t)
    = R1(t) * R2(t) * ... * Rn(t)
    = exp[-l1 t] * exp[-l2 t] *
    ... * exp[-lnt]
    = exp[-(l1 +l2 + ... + ln )t].
  • Since Rsubsystem(t) = exp[-lsubsystemt], it follows that lsubsystem = l1 +l2 + ... + ln.

Have obtained the failure rate of each of the hardware subsystems, some interesting design trade-offs can now be analyzed. The mean time before failure, (MTBF), is the average time that an entity will operate before a failure will occur, and it can be shown to be given by MTBF = 1/l. A related parameter is the mean time to repair, MTTR, which is the average time needed to detect and repair a failed entity. A related parameter is the repair rate, m, which is given by MTTR = 1/m. A very useful parameter is the ratio of l/m, which is given the label r.

System availability, A, is defined as the probability that the system is operational. When vendors talk about "five nines" availability, they are implying that their system will be operational for 99.999 percent of the time and out of service (in a failed mode of operation) for only 0.001 percent of the time. System availability for a simplex system is given by the simple formula:

A = MTBF/(MTBF + MTTR) = 1/(1+r).

As an example, consider a single circuit card system in which the circuit card has an MTBF of two years (17,520 hours) and an MTTR of eight hours. The resulting system availability is A = 0.9995. Thus, the system will be operational for 99.95 percent of the time, which only yields three nines availability (4.38 hours of downtime per year).

Using a Markov Model which assumes that repairs are allowed on only one circuit card at a time, it can be shown2 that a duplicated version of the above system (with one active and one stand-by circuit card) will result in an overall system availability given by: A = (1-r2)/(1-r3).

Thus, the resulting system availability for the duplicated version of the above system is A = 0.9999997. This duplicated system will be operational for 99.99997 percent of the time, which yields six nines availability (9.46 seconds of downtime per year). This illustrates the power of duplication and redundancy in a high-reliability system hardware design.

Analyzing reliability and availability in software systems

In a well-designed system, proper use of duplication and redundancy within the hardware can almost always push the hardware system availability numbers to very acceptable values (six nines or better). Thus, in a well-designed system, the hardware system availability numbers will typically be a negligible portion of the total system availability calculation. In actuality, the availability of the entire system will usually be determined by the availability of the software design.

In software systems, only one type of failure can occur. These are failures resulting from uncorrected coding errors (because the degradation of a processor is classified as a hardware error). The high levels of complexity in most present-day software architectures practically guarantee that some uncorrected software errors will make it into a released product.

In software reliability calculations, it is assumed that a certain fraction of the coded lines within the software design will likely lead to uncorrected errors that will cause catastrophic system failures. The fraction of lines that generate catastrophic errors within program module i is known as the catastrophic defect density, di. This is often described in units of defects per KNCSL (1000 Non-Commentary Source Lines of code). In practice, Catastrophic Defect Densities are usually estimated based on past experience. The fraction of time that the processor spends within program module i is defined to be the occupancy, qi. The frequency of execution of lines of code is defined to be the code line execution rate, E, which is measured in units of code lines per unit time. A typical system with n program modules will have a software failure rate l given by l = [(d1 * q1) + (d2 * q2 ) + ... + (dn * qn)] • E.

As an example, a software design with one program module with d=10-8 defects per KNCSL, q=1.0, and E=104 lines per second will experience a failure rate of l=0.0000001 failures per second. This results in an MTBF of 2777.77 hours. If the MTTR is four minutes (the time to re-boot the processor), then the resulting availability is 0.99997 (four nines availability).

High reliability: Getting the job done in CMTS designs

A true carrier-class CMTS design must provide for fault prevention, fault detection, fault avoidance, fault isolation, fault location, and fault recovery.

Fault prevention requires that designers adhere to good engineering practices such as commenting all source code, terminating transmission, isolating signal traces, and providing adequate air-flow for cooling.

Fault detection in hardware requires special-purpose hardware functions such as bit-level parity checkers and/or cyclic redundancy check (CRC) checkers on all data lines between ICs and on all memory data transfers. Hardware should hunt for and identify any odd signal behavior in the system. Fault detection in software requires the addition of checking functions such as continual memory content audits, processor sanity pings and process watch-dog timers to ensure overall system integrity. All of these fault detection processes must be given high priority, because rapid fault detection will decrease the MTTR and increase the overall system availability.

Fault avoidance requires that active traffic flows be migrated from a suspicious subsystem to a redundant subsystem so that customer service is not affected. Different levels of redundancy can be used, ranging from full duplication (one spare subsystem for each active subsystem) to N+1 sparing (one spare subsystem shared by N active subsystems). Protection switching between active and spare subsystems can result in brief service interruption. The duration of such an interruption must be minimized. If the duration is zero, the switching function is said to be hitless. Redundancy should be included on all CMTS hardware subsystems (control units, Internet interfaces, power supplies, fans, and RF cable interfaces).

Once fault avoidance is completed, the system can initiate fault isolation actions that take the suspicious subsystem out of service so that it will not send confusing, error-propagating signals to other subsystems. In addition to isolation of hardware failures, the scope of software failures should also be isolated to limit the effect to a small portion of the system. Small software failure groups should be isolated (to stop the propagation of failures) and should be independently recoverable.

Fault location processes pinpoint the exact location and nature of the faulty component. This requires that diagnostics be run across all components within the suspicious subsystem. Once a fault is correctly located, fault recovery actions must send real-time alarms to system administrators. A good fault recovery system will guide administrators directly to the faulty subsystem and allow them to replace the faulty subsystem with a functioning subsystem.

What to look for

MSOs that are planning their next-generation networks may have to seriously consider reliability and fault tolerance in their CMTS equipment. A few well-planned questions can help expose the strengths and weaknesses associated with each of the vendor solutions:

  1. Which subsystems have redundant back-ups and which subsystems do not?
  2. What is the scheme for fault detection? How does the system search for and detect faults?
  3. Is there flexibility to build the level of redundancy that is required for different business applications?
  4. What is the service interruption when spares are switched for active subsystems? For example, when this occurs, do cable modems re-range and re-register?
  5. What failure rates were used for availability calculations?

By addressing these important reliability issues, MSOs can be assured of getting a carrier-class solution that will satisfy their subscriber demands for years to come.

1. Edward C. Jordan, Editor-in-Chief, Reference Data for Engineers: Radio, Electronics, Computer, and Communications (Seventh Edition), Howard W. Sams & Co. Inc., Indianapolis, Ind., 1985, pp. 45-1–45-26.
2. Kishor S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall Inc., Englewood Cliffs, NJ, 1982, pp. 365 – 388.

Share This Story

You may login with either your assigned username or your e-mail address.
The password field is case sensitive.