University of Wisconsin - Madison | CS 540 Lecture Notes | C. R. Dyer |
where words is a string of words in a given natural language like English, and signal is a sequence of observed acoustic data that has been digitized and pre-processed.
Information is available on the web about commercial speech recognition systems.
Next, overlapping subsequences of the digitized data are processed. Subsequences of length about 10 msecs (i.e., containing about 80 - 160 data points) are usually used, defining a sequence of frames. Within each frame a set of features are then detected. For example, total energy in the frame, and difference in energy between the current frame and the previous frame. About 8 to 40 features are usually detected for each frame. This n-D vector of features values is then itself quantized using a process called vector quantization into, e.g., 256 "bins" so that each frame is now described by one of 256 possible "labels." The result of this whole process is a compact description of overlapping regions of the acoustic signal that should be sufficient for word recognition.
Since we are given a digital signal of the type described above and our goal is to find the sequence of words that maximizes P(words | signal), P(signal) is a constant for a given acoustic input and therefore we can simply drop this term. Thus, our new goal is to compute
P(words) represents our language model in that it specifies the prior probability of a particular word string. Thus it has to quantify the likelihood of a sequence of words occurring in English. P(signal | words) is the acoustic model, which specifies the probability of the acoustics given that a sequence of words was spoken. This part is complicated by the fact that often there are many ways of pronouncing a given word. The next sections describe in more detail the language model and the acoustic model.
P(w1 w2 ... wn) = P(w1) P(w2 | w1) ... P(wn-1 | w1, ..., wn-2) P(wn | w1, ..., wn-1)
Now we need to compute the probability that the first word in the sentence is w1, and the probability that the second word in the string is w2 given that the first word is w1, etc. The final part is the probability that the last word in the string is wn given that the sequence of n-1 words before it was w1, ..., wn-1. This expression is very complex because it requires that we determine conditional probabilities of long sequences of possible words. If words is a sequence of n words and our language contains m words, then to compute P(wn | w1, ..., wn-1) requires collecting statistics for mn-1 possible starting sequences of words.
Instead, we will make a simplifying assumption, called the First-order Markov Assumption, which says that the probability of a word is only dependent (approximately) only on the previous word:
Using this assumption we now have a much simpler expression for computing the joint probability:
This simplified model is called a bigram model because it relates consecutive pairs of words. This provides a minimum amount of context for determining the probability of each word in a sentence. Obviously, using more context (for example, a second-order Markov assumption so that we use a trigram grammar) would be more accurate but more expensive to compute.
We can construct a table representing the bigram model by computing statistics of the frequency of all possible pairs of words in a (large) training set of word strings. For example, if the word "a" appears 10,000 times in the training set and "a" is followed by "gun" 37 times, then (gun | a) = 37/10,000 = 0.0037.
Instead of representing the bigram model in a table, we can alternatively represent it as a probabilistic finite state machine. Create a node (state) for each possible word, and draw an arc from each node to every other node. Label each arc with the probability that the word associated with the source node is followed by the word associated with the destination node. Finally, add a node called START and an arc from it to all of the other nodes. Label these arcs with the probability that the word associated with the destination node can start a sentence. We can use this FSM to determine the probability of a given sentence by starting at node START and making the transitions to the nodes corresponding to successive words in the given sentence. Multiplying the probabilities of the arcs that are traversed gives the estimated joint probability associated with the First-order Markov assumption given above.
For more information on HMMs and speech recognition see
Copyright © 1998-2003 by Charles R. Dyer. All rights reserved.