Dynamic features for visual speech- 
reading: A systematic comparison 
Michael S. Gray 's, Javier R. Movellnn , Terrence J. Sejnowski 2,s 
Departments of Cognitive Science I and Biology 2 
University of California, San Diego 
La Jolla, CA 92093 
and 
Howard Hughes Medical Institute s 
Computational Neurobiology Lab 
The Salk Institute, P. O. Box 85800 
San Diego, CA 92186-5800 
Email: mgray, jmovellan, tsejnowski@ucsd.edu 
Abstract 
Humans use visual as well as auditory speech signals to recognize 
spoken words. A variety of systems have been investigated for per- 
forming this task. The main purpose of this research was to sys- 
tematically compare the performance of a range of dynamic visual 
features on a speechreading task. We have found that normal- 
ization of images to eliminate variation due to translation, scale, 
and planar rotation yielded substantial improvements in general- 
ization performance regardless of the visual representation used. In 
addition, the dynamic information in the difference between suc- 
cessive frames yielded better performance than optical-flow based 
approaches, and compression by local low-pass filtering worked sur- 
prisingly better than global principal components analysis (PCA). 
These results are examined and possible explanations are explored. 
1 INTRODUCTION 
Visual speech recognition is a challenging task in sensory integration. Psychophys- 
ical work by McGurk and MacDonald [5] first showed the powerful influence of 
visual information on speech perception that has led to increased interest in this 
752 M. S. Gray, J. R. Movellan and T. J. Sejnowski 
area. A wide variety of techniques have been used to model speech-reading. Yuhas, 
Goldstein, Sejnowski, and Jenkins [8] used feedforward networks to combine gray 
scale images with acoustic representations of vowels. Wolff, Prasad, Stork, and Hen- 
necke [7] explicitly computed information about the position of the lips, the shape of 
the mouth, and motion. This approach has the advantage of dramatically reducing 
the dimensionality of the input, but critical information may be lost. The visual 
information (mouth shape, position, and motion) was the input to a time-delay neu- 
ral network (TDNN) that was trained to distinguish among consonant-vowel pairs. 
A separate TDNN was trained on the acoustic signal. The output probabilities for 
the visual and acoustic signals were then combined multiplicatively. Bregler and 
Konig [1] also utilized a TDNN architecture. In this work, the visual information 
was captured by the first 10 principal components of a contour model fit to the lips. 
This was enough to specify the full range of lip shapes ("eigenlips"). Bregler and 
Konig [1] combined the acoustic and visual information in the input representation, 
which gave improved performance in noisy environments, compared with acoustic 
information alone. 
Surprisingly, the visual signal alone carries a substantial amount of information 
about spoken words. Garcia, Goldschen, and Petajan [2] used a variety of visual 
features from the mouth region of a speaker's face to recognize test sentences using 
hidden Markov models (HMMs). Those features that were found to give the best 
discrimination tended to be dynamic in nature, rather than static. Mase and Pent- 
land [4] also explored the dynamic information present in lip images through the 
use of optical flow. They found that a template matching approach on the optical 
flow of 4 windows around the edges of the mouth yielded results similar to humans 
on a digit recognition task. Movellan [6] investigated the recognition of spoken dig- 
its using only visual information. The input representation for the hidden Markov 
model consisted of low-pass filtered pixel intensity information at each time step, as 
well as a delta image that showed the pixel by pixel difference between subsequent 
time steps. 
The motivation for the current work was succinctly stated by Bregler and Konig [1]: 
"The real information in lipreading lies in the temporal change of lip positions, 
rather than the absolute lip shape." Although different kinds of dynamic visual 
information have been explored, there has been no careful comparison of different 
methods. Here we present results for four different dynamic techniques that are 
based on general purpose processing at the pixel level. The first approach was to 
combine low-pass filtered gray scale pixel values with a delta image, defined as the 
difference between two successive gray level images. A PCA reduction of this gray- 
scale and delta information was investigated next. The final two approaches were 
motivated by the kinds of visual processing that are believed to occur in higher 
levels of the visual cortex. We first explored optical flow, which provides us with 
a representation analogous to that in primate visual area MT. Optical flow output 
was then combined with low-pass filtered gray-scale pixel values. Each of these four 
representations was tested on two different datasets: (1) the raw video images, and 
(2) the normalized video images. 
Dynamic Features for lsual Speechreading: A Systematic Comparison 753 
ttttltfttt 
.Itttttlffiltttf 
'-.. - . ...,:.:;:.:..;:.;..,j. 
Figure 1: Image processing techniques. Left column: Two successive video frames 
(frames I and 2) from a subject saying the digit "one". These images have been 
made symmetric by averaging left and right pixels relative to the vertical midline. 
Middle column: The top panel shows gray scale pixel intensity information of frame 
2 after low-pass filtering and down-sampling to a resolution of 15 x 20 pixels. The 
bottom panel shows the delta image (pixel-wise subtraction of frame I from frame 
2), after low-pass filtering and downsampling. Right column: The top panel shows 
the optical flow for the 2 video frames in the left column. The bottom panel shows 
the reconstructed optical flow representation learned by a 1-state HMM. This can 
be considered the canonical or prototypical representation for the digit "one" across 
our database of 12 individuals. 
2 METHODS AND MODELS 
2.1 TRAINING SAMPLE 
The training sample was the Tulipsl database (Movellan [6]): 96 digitized movies 
of 12 undergraduate students (9 males, 3 females) from the Cognitive Science De- 
partment at UC-San Diego. Video capturing was performed in a windowless room 
at the Center for Research in Language at UC-San Diego. Subjects were asked to 
talk into a video camera and to say the first four digits in English twice. Subjects 
could monitor the digitized images in a small display conveniently located in front 
of them. They were asked to position themselves so that their lips were roughly 
centered in the feed-back display. Gray scale video images were digitized at 30 
frames per second, 100 x 75 pixels, 8 bits per pixel. The video tracks were hand 
segmented by selecting a few relevant frames before the beginning and after the 
end of activity in the acoustic track. There were an average of 9.7 frames for each 
movie. Two sample frames are shown in the left column of Figure 1. 
2.2 IMAGE PROCESSING 
We compared the performance of four different visual representations for the digit 
recognition task: low-pass + delta, PCA of (gray-scale + delta), flow, and low-pass 
754 M. S. Gray, J. R. Movellan and T. J. Sejnowski 
Figure 2: The first 9 principal components of the raw lip images, starting in upper 
left and proceeding in normal English reading order. The left half of each image 
contained pixel intensity information, and the right half represented the delta image. 
-I- flow. The video images were made symmetric by averaging pixels from the left and 
right side of the image (Figure 1, left column). The low-pass + delta representation 
(Movellan [6]) consisted of 2 parts (Figure 1, middle column). The images were 
low-pass filtered, and downsampled to a resolution of 15 x 20 pixels. Delta images 
were formed from the pixel-by-pixel difference between subsequent time frames, 
and then low-pass filtered and downsampled to 15 x 20 pixels. Because the images 
were symmetric, we used only half of the low-pass and delta images, resulting in a 
300-dimensional input vector. 
Our PCA of (gray-scale + delta) representation was derived from input images 
and delta images at the original resolution of 75 x 100 pixels. We examined the 
projections on to principal components (PCs) 1-300. Because the projection onto 
the first principal component accounted for 95% of the variance, differences between 
the remaining projections (2-300) were substantially reduced when the coefficients 
were normalized to the range [0 1] for input to the HMM. For this reason, results 
were also obtained for the projections on to PCs 2-301. This range of PCs gave 
much better performance, and it is the results on this set of PCs that are reported 
here. The first 9 PCs of the lip images are illustrated in Figure 2. 
The delta image representation captures information about changes in the lip image 
over time, but does not signal the direction in which lip features are moving. To get 
directional information, we computed optical flow. This computation was based on 
the standard brightness constraint equation, followed by thresholding to eliminate 
locations without any edges or with small temporal derivatives. The resulting flow 
field was then low-pass filtered and downsampled to obtain our flow representation: 
a 140-dimensional input vector (70 for left/right motion, and 70 for up/down mo- 
tion), which is illustrated in Figure I (right column, top panel). Finally, for the 
low-pass + flow inputs, low-pass filtered pixel intensities were combined with optical 
flow. 
All four visual representations were tested on two different datasets. The first 
dataset contained the raw video images described above. In the second dataset, 
images were normalized so that variations due to translation, scale, and planar 
Dynamic Features for Visual Speechreading: A Systematic Comparison 755 
'1' 
State 1 .............. 
State 2 .............. 
State 3 ............. 
Ste 4 "" ...... "" 
Ste5 : :':::::: ::::: 
Spoken Digits 
'2' 
...... ); ..... 
....... o ...... 
'3' 
 ;;.,, .... 
:.!!. : :  :. 
'4' 
..,..;..,... 
Figure 3: The optical flow representations learned by the 5-state HMMs. 
rotation were eliminated. The images were normalized using parameters obtained 
from contour modeling of the lips by Luettin, Thacker, and Beet [3]. 
2.3 RECOGNITION ENGINE 
The different visual representations described above formed the input to hidden 
Markov models which were separately trained for each word category. The images 
were modeled as mixtures of Gaussian distributions in pixel space. The initial state 
probabilities, transition probabilities, mixture coefficients, and mixture centroids 
were optimized using the EM algorithm. Because the probability of images rapidly 
approached zero when using the EM algorithm with Gaussian mixtures, we con- 
strained the variance parameters for all the states and mixtures to be equal. In 
addition, the centroids of the mixtures were initialized using a linear segmentation 
followed by k-means clustering. 
3 RESULTS 
Each input representation was tested with 9 different architectures generated by 
combining different numbers of states (5, 7, 9) with different numbers of Gaussians 
(3, 5, 7) to represent each state. Each set of simulations took approximately 20 hours 
on a 300 MHz DEC Alpha processor. The best performance (of the 9 architectures) 
for each input representation is shown in Table 1. Results from the low-pass + delta 
756 M. S. Gray, J. R. Movellan and T. J. Sejnowski 
representation closely matched, as expected, Movellan [6]. The small difference 
between Movellan [6] and the results reported here are likely due to differences in 
the low-pass filtering kernel, and different initializations of the parameters of the 
Ganssians during k-means clustering. 
The flow representations (with or without low-pass pixel intensity information) 
gave similar results. Both of the flow representations, however, yielded better per- 
formance when the normalized images were used as input. For the PCA represen- 
tation, performance was poor on the raw images, but improved markedly on the 
normalized dataset. The low-pass + delta representation gave excellent results for 
both raw and normalized images. Although the low-pass + delta representation 
matched the performance of normal humans (89.9%), none of these models has yet 
to reach the level of trained lipreaders (95.5%) on this same database (Movellan [6]). 
The states learned by the HMM for these flow inputs give us information about 
the dynamic movement of the lips through time. These learned states (for a 5- 
state HMM) are shown in Figure 3, and provide an intuitive notion to the kinds of 
muscular activity in the face that correspond to each digit. Figure I (right column, 
bottom panel) shows the flow learned by a 1-state HMM. 
Image Processing Performance: Performance: 
Raw Images Normalized Images 
Low-pass + Delta 85.4 90.6 
PCA of (Gray-scale + Delta) 67.7 85.4 
Low-pass + Flow 61.5 68.8 
Flow 63.5 66.7 
Table 1: Best generalization performance (% correct) for the different visual input 
representations using both raw and normalized video images. 
4 DISCUSSION 
The purpose of this research was to compare a range of image processing techniques 
on a visual digit recognition task. We found that normalization of the lip images 
(for translation, rotation, and scale) was a crucial factor in improving the perfor- 
mance of the different visual representations. Normalization is important because 
the HMM has no mechanism to account for the wide variations in lip position and 
size that are present in the original dataset. It was also found that the dynamic 
temporal information present in the delta image contributes more to generaliza- 
tion performance than optical flow. Finally, compression by local low-pass filtering 
yielded better results than global principal components analysis. 
One area of inquiry is the difference between the information provided by optical 
flow and the delta image. Both carry dynamic information that signals the difference 
between the lip images at subsequent time steps. Why should the delta image yield 
15% better performance as compared to optical flow, when combined with low-pass 
pixel intensity information? Part of the explanation may lie in the thresholding 
performed in the optic flow computation, which is designed to eliminate noisy flow 
estimates. Unfortunately, it also leads to an optical flow output that is very sparse, 
as illustrated in Figure 1. The delta image, on the other hand, contains information 
Dynamic Features for Visual Speechreading: A Systematic Comparison 757 
at all points in the image. Although the significance of the delta image is not well 
understood, it does contain local dynamic information at all locations. 
The work described here represents an exploration of the kinds of dynamic informa- 
tion that may be valuable for speechreading. In contrast to model-based approaches, 
we have sought to retain as much information as possible in the lip images by al- 
lowing the recognition engine to find relevant features of the input. This effort 
to combine sophisticated image processing techniques with machine learning algo- 
rithms is a valuable approach that will likely lead to new insights in a variety of 
applications. 
Acknowledgements 
We thank Dr. Juergen Luettin for the use of his parameters for normalization of the 
Tulips1 database. Michael S. Gray was supported by the McDonnell-Pew Center 
for Cognitive Neuroscience in San Diego. 
References 
[1] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In Proceedings 
of IEEE ICASSP, pages 669-672. Adelaide, Australia, 1991. 
[2] 
O.N. Garcia, A.J. Goldschen, and E.D. Petajan. Feature extraction for op- 
tical automatic speech recognition or automatic lipreading. Technical Report 
G WU-IIST-9232, Dept. of Electrical Engineering and Computer Science, George 
Washington University, 1992. 
[3] 
J. Luettin, N.A. Thacker, and S.W. Beet. Visual speech recognition using active 
shape models and hidden markov models. In Proceedings of the IEEE Interna- 
tional Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 
817-820, Atlanta, Ga, 1996. IEEE. 
[4] K. Mase and A. Pentland. Automatic lipreading by optical-flow analysis. Sys- 
tems and Computers in Japan, 22(6):67-76, 1991. 
[5] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:126- 
130, 1976. 
[6] 
J.R. Movellan. Visual speech recognition with stochastic networks. In 
G. Tesauro, D.S. Touretzky, and T. Leen, editors, Advances in Neural Infor- 
mation Processing Systems, volume 7, pages 851-858. MIT Press, Cambridge, 
MA, 1995. 
[?] 
G.J. Wolff, K.V. Prasad, D.G. Stork, and M. Hennecke. Lipreading by neu- 
ral networks: Visual preprocessing, learning and sensory integration. In J.D. 
Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Informa- 
tion Processing Systems, volume 6, pages 1027-1034. Morgan Kaufmann, San 
Francisco, CA, 1994. 
[8] 
B.P. Yuhas, Jr. Goldstein, M.H., T.J. Sejnowski, and R.E. Jenkins. Neural net- 
work models of sensory integration for improved vowel recognition. Proceedings 
of the IEEE, 78(10):1658-1668, 1990. 
