University of Wisconsin - MadisonCS 540 Lecture NotesC. R. Dyer

Speech Recognition (Chapter 15.1 - 15.3, 23.5)


What is Speech Recognition?

Speech recognition is the diagnostic task of recovering the words that produce a given acoustic signal. In other words, it is the problem of transforming a digitally-encoded acoustic signal of a speaker talking in a natural language (e.g., English) into text in that language. Given the uncertainty at many levels of this problem (e.g., introduced by background noise, digitization noise, speaker's accent), we can formally specify the problem as

argmaxwordsP(words | signal)

where words is a string of words in a given natural language like English, and signal is a sequence of observed acoustic data that has been digitized and pre-processed.

Information is available on the web about commercial speech recognition systems.

Signal Processing

The raw acoustic signal captured by a microphone is initially digitized and pre-processed. Digitization consists of sampling the analog signal, usually between 8 and 16KHz, and then quantizing each sample point, usually as an 8 to 12 bit value per point.

Next, overlapping subsequences of the digitized data are processed. Subsequences of length about 10 msecs (i.e., containing about 80 - 160 data points) are usually used, defining a sequence of frames. Within each frame a set of features are then detected. For example, total energy in the frame, and difference in energy between the current frame and the previous frame. About 8 to 40 features are usually detected for each frame. This n-D vector of features values is then itself quantized using a process called vector quantization into, e.g., 256 "bins" so that each frame is now described by one of 256 possible "labels." The result of this whole process is a compact description of overlapping regions of the acoustic signal that should be sufficient for word recognition.

How to Wreak a Nice Beach

The basic approach to speech recognition is to use Bayes's Rule to break up the problem into manageable parts:

P(words | signal) = P(words)P(signal | words)/P(signal)

Since we are given a digital signal of the type described above and our goal is to find the sequence of words that maximizes P(words | signal), P(signal) is a constant for a given acoustic input and therefore we can simply drop this term. Thus, our new goal is to compute

argmaxwordsP(signal | words)P(words)

P(words) represents our language model in that it specifies the prior probability of a particular word string. Thus it has to quantify the likelihood of a sequence of words occurring in English. P(signal | words) is the acoustic model, which specifies the probability of the acoustics given that a sequence of words was spoken. This part is complicated by the fact that often there are many ways of pronouncing a given word. The next sections describe in more detail the language model and the acoustic model.

The Language Model

P(words) is the prior probability that a sequence of words words = w1 w2 ... wn is likely in the given natural language. For example, "I have a gun" is more likely than "I have a gub." Or, "how to recognize speech" is more likely than "how to wreak a nice beach." One way to express this joint probability is to use the chain rule as follows:

P(w1 w2 ... wn) = P(w1) P(w2 | w1) ... P(wn-1 | w1, ..., wn-2) P(wn | w1, ..., wn-1)

Now we need to compute the probability that the first word in the sentence is w1, and the probability that the second word in the string is w2 given that the first word is w1, etc. The final part is the probability that the last word in the string is wn given that the sequence of n-1 words before it was w1, ..., wn-1. This expression is very complex because it requires that we determine conditional probabilities of long sequences of possible words. If words is a sequence of n words and our language contains m words, then to compute P(wn | w1, ..., wn-1) requires collecting statistics for mn-1 possible starting sequences of words.

Instead, we will make a simplifying assumption, called the First-order Markov Assumption, which says that the probability of a word is only dependent (approximately) only on the previous word:

P(wn | w1, ..., wn-1) =~ P(wn | wn-1)

Using this assumption we now have a much simpler expression for computing the joint probability:

P(w1 w2 ... wn) = P(w1) P(w2 | w1) ... P(wn | wn-1)

This simplified model is called a bigram model because it relates consecutive pairs of words. This provides a minimum amount of context for determining the probability of each word in a sentence. Obviously, using more context (for example, a second-order Markov assumption so that we use a trigram grammar) would be more accurate but more expensive to compute.

We can construct a table representing the bigram model by computing statistics of the frequency of all possible pairs of words in a (large) training set of word strings. For example, if the word "a" appears 10,000 times in the training set and "a" is followed by "gun" 37 times, then (gun | a) = 37/10,000 = 0.0037.

Instead of representing the bigram model in a table, we can alternatively represent it as a probabilistic finite state machine. Create a node (state) for each possible word, and draw an arc from each node to every other node. Label each arc with the probability that the word associated with the source node is followed by the word associated with the destination node. Finally, add a node called START and an arc from it to all of the other nodes. Label these arcs with the probability that the word associated with the destination node can start a sentence. We can use this FSM to determine the probability of a given sentence by starting at node START and making the transitions to the nodes corresponding to successive words in the given sentence. Multiplying the probabilities of the arcs that are traversed gives the estimated joint probability associated with the First-order Markov assumption given above.

For more information on HMMs and speech recognition see

  1. Markov Models and Hidden Markov Models: A Brief Tutorial, E. Fosler-Lussier, Technical Report TR-98-041, International Computer Science Institute, University of California, Berkeley, 1998.
  2. Introduction to Hidden Markov Models by Roger Boyle. Another version appears in M. Sonka, V. Hlavac and R. Boyle, Image Processing, Analysis, and Machine Vision, 2nd ed., Brooks/Cole Publishing, 1999, 417-423.
  3. Speech recognition by machines and humans, R. Lippmann, Speech Communication 22, 1997, 1-15.
  4. 2000 NIST evaluation of conversational speech recognition over the telephone: English and Mandarin performance results, J. Fiscus et al., 2000.
  5. Chapter 7: HMMs and speech recognition, in Speech and Language Processing, D. Jarafsky and J. Martin, Prentice Hall, 2000, 235-284.
  6. Special Issue on Speech Recognition, Computer 35(4), April 2002, 38-66.
  7. An introduction to hidden Markov models, L. Rabiner and B. Juang, IEEE ASSP Magazine 3, 1986, 4-16.
  8. A tutorial on hidden Markov models and selected applications in speech recognition, L. Rabiner, Proc. IEEE 77(2), 1989, 257-285.


Copyright © 1998-2003 by Charles R. Dyer. All rights reserved.