In the early 1950s a trio of researchers at Bell Laboratories took great pains to hear and understand the varying acoustical patterns associated with the human expression of vowels. They paid attention to subtle differences in the phonetics associated with a soft “o” versus those of an elongated “e.” They listened to the way air propagated through the vocal tract when a speaker lingered over the sound of a hard “i,” hearing all the way down to the throaty vibration of the vocal chords. They appreciated the curtness of the soft “e” and the soft “o,” and how those sounds differed in energy concentration from the more elaborate, more soothing intonations of the “eye” sound in “five” or the “ooh” that lingered when a speaker said “two.”
By listening carefully to these natural modes of resonance, and by understanding their place in the speech spectrum, the Bell Labs researchers were able to construct frequency maps that aligned vowel expressions with spoken renditions of the Arabic numeral system. The soft “e” in “ten,” for instance, could be recognized as distinct from the long, hard “e” sound in “three.” By mapping the resonance of these vowel sounds, or formants, to the spoken numbers, the researchers could isolate the digits being spoken by a human voice.
Thus was born the first voice-recognition system for use in the telecommunications industry. The Bell Labs invention was crude, in that it recognized only a single speaker’s voice. It wasn’t perfect, but with a 97 percent accuracy mark, it was close. Its creators thought enough of it to give it a sweet-sounding name. They called it “Audrey.”
Speech recognition has come a long way since then. In the 1970s, the U.S. Department of Defense funded the development of the DARPA Speech Understanding Research program, which widened the number of machine-recognizable words to 1,011, or the rough equivalent of a 3- year-old’s vocabulary. In the same decade, Bell Labs broke new ground by inventing a system that could discern words spoken by more than one person. But the big breakthrough came in the 1980s when a statistical modeling approach known as the hidden Markov model vastly expanded the vocabulary that machines could recognize, resulting in the first commercial-scale applications of speech recognition.
Into the 1990s, the technology was far from perfect, as anyone who endured a session of nonsensical banter with a robotic telephone attendant would quickly discover. (BellSouth’s VAL portal was among the first, if not the first, to employ speech recognition for customer care purposes.)
The seismic moment, though, came in 2010, when Google coupled foundational approaches for statistical speech modeling with massive computing and data analysis capabilities to stretch the boundaries of the technology and make it more palatable for average consumers to use. Although imperfect, Google’s Voice Search application for the iPhone was a revelation in that, in its finest moments, it seemed to recognize and respond not just to words as isolated data inputs, but to the context in which they were spoken. Apple itself extended and improved on the formula in 2011 with its implementation of Siri, the cloudbased, contextually adroit app whose name quickly became synonymous with speech recognition itself.
It hasn’t taken long for television industry participants to catch up to the speech recognition wave. Sensing that the technology may have utility for replacing the myriad buttons and physical commands associated with modern-day remote control inputs, cable and video providers are starting to work with technology developers to find ways to integrate speech recognition and voice commands into their platforms. At last month’s Cable Show in Washington, DC, speech recognition was a major theme, as providers including Nuance Communications, Veveo and others showed off platforms that not only can change the TV channel upon spoken command, but interpret conversational snippets and queries to come up with intelligent responses. One example from Nuance: the ability to parse different meanings from the word “play” when it’s used in one instance to instruct a set-top to “play” a movie and in another to find out who the Philadelphia Phillies “play” today on the baseball field.
That’s a far cry from the work done by Bell Labs in 1952 to determine which of 10 digits a single individual was referring to by analyzing the acoustic qualities of vowel sounds. But every technology progression has to start somewhere. And in this case, what launching point could possibly have been better than starting with the number “one?”