Bayesian Robustification for Audio Visual 
Fusion 
Javier Movellan * 
movellancogsci. ucsd. edu 
Department of Cognitive Science 
University of California, San Diego 
La Jolla, CA 92092-0515 
Paul Mineiro 
prnineirocogsci. ucsd. edu 
Department of Cognitive Science 
University of California, San Diego 
La Jolla, CA 92092-0515 
Abstract 
We discuss the problem of catastrophic fusion in multimodal recog- 
nition systems. This problem arises in systems that need to fuse 
different channels in non-stationary environments. Practice shows 
that when recognition modules within each modality are tested in 
contexts inconsistent with their assumptions, their influence on the 
fused product tends to increase, with catastrophic results. We ex- 
plore a principled solution to this problem based upon Bayesian 
ideas of competitive models and inference robustification: each 
sensory channel is provided with simple white-noise context mod- 
els, and the perceptual hypothesis and context are jointly esti- 
mated. Consequent]y, context deviations are interpreted as changes 
in white noise contamination strength, automatically adjusting the 
influence of the module. The approach is tested on a fixed lexicon 
automatic audiovisual speech recognition problem with very good 
results. 
i Introduction 
In this paper we address the problem of catastrophic fusion in automatic multimodal 
recognition systems. We explore a principled solution based on the Bayesian ideas of 
competitive models and inference robustification (Clark &: Yuille, 1990; Box, 1980; 
O'Hagan, 1994). For concreteness, consider an audiovisual car telephony task which 
we will simulate in later sections. The task is to recognize spoken phone numbers 
based on input from a camera and a microphone. We want the recognition system to 
work on environments with non-stationary statistical properties: at times the video 
signal (V) may be relatively clean and the audio signal (A) may be contaminated by 
sources like the radio, the engine, and friction with the road. At other times the A 
signal may be more reliable than the V signal, e.g., the radio is off, but the talker's 
mouth is partially occluded. Ideally we want the audio-visual system to combiae 
the A and V sources optimally given the conditions at hand, e.g., give more weight 
to whichever channel is more reliable at that time. At a minimum we expect that 
for a wide variety of contexts, the performance after fusion should not be worse than 
the independent unimodal systems (Bernstein &: Benoit, 1996). When component 
modules can significantly outperform the overall system after fusion, catastrophic 
fusion is said to have occurred. 
* To whom correspondence should be addressed. 
Bayesian Robustification for Audio Visual Fusion 743 
Fixed vocabulary audiovisual speech recognition (AVSR) systems typically con- 
sist of two independent modules, one dedicated to A signals and one to V signals 
(Bregler, Hild, Manke & Waibel, 1993; Wolff, Prasad, Stork & Hennecke, 1994; 
Adjondani & Benoit, 1996; Movellan & Chadderdon, 1996). From a Bayesian per- 
spective this modularity reflects an assumption of conditional independence of A 
and V signals (i.e., the likelihobd function factorizes) 
p(xaxvl:iAaAv) p(xal:iA)p(xvi:iAv), (1) 
where Xa and Xv are the audio and video data, :i is a perceptual interpretation 
of the data (e.g., the word "one") and (Aa, Av} are the audio and video models 
according to which these probabilities are calculated, e.g., a hidden Markov model, 
a neural network, or an exemplar model. Training is also typically modularized: the 
A module is trained to maximize the likelihood of a sample of A signals while the V 
module is trained on the corresponding sample of V signals. At test time new data 
are presented to the system and each module typically outputs the log probability 
of its input given each perceptual alternative. Assuming conditional independence, 
Bayes' rule calls for an affine combination of modules 
tb -- argmax {10gp(:ilXaXvhav)) 
= argmax{logp(xl:ih) + logp(xvl:ihv) + logp(:i)) (2) 
where tb is the interpretation chosen by the system, and p(wi) is the prior probability 
of each alternative. This fusion rule is optimal in the sense that it minimizes the 
expected error: no other fusion rule produces smaller error rates, provided the 
models {ha, hv) and the assumption of conditional independence are correct. 
Unfortunately a naive application of Bayes' rule to AVSR produces catastrophic 
fusion. The A and V modules make assumptions about the signals they receive, 
either explicitly, e.g., a well defined statistical model, or implicitly, e.g., a black- 
box trained with a particular data sample. In our notation these assumptions are 
reflected by the fact that the log-likelihoods are conditional on models: 
The fact that modules make assumptions implies that they will operate correctly 
only within a restricted context, i.e, the collection of situations that meet the as- 
sumptions. In practice one typically finds that Bayes' rule assigns more weight to 
modules operating outside their valid context, the opposite of what is desired. 
2 Competitive Models and Bayesian Robustification 
Clark and Yuille (1990) and Yuille and Bulthoff (1996) analyzed information inte- 
gration in sensory systems from a Bayesian perspective. Modularity is justified in 
their view by the need to make assumptions that disambiguate the data available to 
the perceptual system (Clark & Yuille, 1990, p. 5). However, this produces modules 
which are valid only within certain contexts. The solution proposed by Clark and 
Yuille (1990) is the creation of an ensemble of models each of which specializes on 
a restricted context and automatically checks whether the context is correct. The 
hope is that by working with such an ensemble of models, robustness under a variety 
of contexts can be achieved (Clark & Yuille, 1990, p. 13). 
Box (1980) investigated the problem of robust statistical inference from a Bayesian 
perspective. He proposed extending inference models with additional "nuisance" 
parameters a, a process he called Bayesian robustification. The idea is to replace 
an implicit assumption about the specific value of a with a prior distribution over 
a, representing uncertainty about that parameter. 
The approach here combines the ideas of competitive models and robustification. 
Each of the channels in the multimodal recognition system is provided with extra 
744 J. Movellan and P. Mineiro 
parameters that represent non-stationary properties of the environment, what we 
call a context model. By doing so we effectively work with an infinite ensemble of 
models each of which compete on-line to explain the data. As we will see later even 
unsophisticated context models provide superior performance when the environment 
is non-stationary. 
We redefine the estimation problem as simultaneously choosing the most probable 
A and V context parameters and the most probable perceptual interpretation 
tb = argmax {maxp(wirrarrv,XaXvav) } (3) 
/ayi O'a 
where aa and a. are the context parameters for the audio and visual channels 
and wi are the different perceptual interpretations. One way to think of this joint 
decision approach is that we let all context models compete and we let only the 
most probable context models have an influence on the fused percept. Hereafter we 
refer to this approach as competitive fusion. 
Assuming conditional independence of the audio and video data and uninformative 
priors for (era, a,, we have 
tb = argmax {logp(o:i) + 
(4) 
[axlogp(xalwi'aAa)] + [axlogp(x, lwi',A,)] }  
Thus conditional independence allows a modular implementation of competitive 
fusion, i.e., the A and V channels do not need to talk to each other until the time 
to make a joint decision, as follows. 
1. For each wi obtain conditional estimates of the context parameters for the 
audio and video signals: 
^2 A 
aais, = argmax { 10gp(XalWirraAa) }, (5) 
O' a 
and 
2 A 
vl, = argmax { logp(xvlWiO'vAv) ). 
O'v 
2. Find the best wi using the conditional context estimates. 
5 - argmax {logp(wi) + 10gp(XalWiS'alJka ) + 10gp(XvliS'vlJkv) 
(6) 
(7) 
3 Application to AVSR 
Competitive fusion can be easily applied to Hidden Markov Models (HMM), an 
architecture closely related to stochastic neural networks and arguably the most 
successful for AVSR. Typical hidden Markov models used in AVSR are defined by 
 Markovian state dynamics: P(qt+ IqO - p(qt+llqt), where qt is the state at 
time t and q-t = (ql,"' qt), 
 Conditionally independent sensor models linking observations to states 
f(xtlqt), typically a mixture of multivariate Gaussian densities 
f(xtlqt) = p(milqt)(27r) -N/2 ]']]--1/2 exp(d(xt,qt,lai,y.)) ' (8) 
i 
Bayesian Robustification for Audio Visual Fusion 745 
where N is the dimensionality of the data, mi is the mixture label, P(mi[qt) 
is the mixture distribution for state qt, i is the centroid for mixture 
Y, is a covariance matrix, and d is the Mahalanobis norm 
d(xt,qt,]gi,) = (xt - ]gi)t-l(xa - ]i). 
(9) 
The approach explored here consists on modeling contextual changes as variations 
on the variance parameters. This corresponds to modeling non-stationary proper- 
ties of the environments as variations in white noise power within each channel. 
Competitive fusion calls for on-line maximization of the variance parameters at the 
same time we optimize with respect to the response alternative. 
tb = argmax logp(wi) + 
[mE,axlogp(xalwiYa)a) ] 
(10) 
The maximization with respect to the variances can be easily integrated into stan- 
dard HMM packages by simply applying the EM learning algorithm (Dampster, 
Laird & Rubin, 1977) on the variance parameters at test time. Thus the only dif- 
ference between the standard approach and competitive fusion is that we retrain 
the variance parameters of each HMM at test time. In practice this training takes 
only one or two iterations of the EM algorithm and can be done on-line. We tested 
this approach on the following AVSR problem. 
Training database We used Tulipsl (Movellan, 1995) a database consisting of 
934 images of 9 male and 3 female undergraduate students from the Cognitive 
Science Department at the University of California, San Diego. For each of these, 
two samples were taken for each of the digits "one" through "four". Thus, the total 
database consists of 96 digit utterances. The specifics of this database are explained 
in (Movellan, 1995). The database is available at http://cogsci.ucsd.edu. 
Visual processing We have tried a wide variety of visual processing approaches 
on this database, including decomposition with local Gaussian templates (Movellan, 
1995), PCA-based templates (Gray, Movellan & Sejnowski, 1997), and Gabor energy 
templates (Movellan & Prayaga, 1996). To date, best performance was achieved 
with the local Gaussian approach. Each frame of the video track is soft-thresholded 
and symmetrized along the vertical axis, and a temporal difference frame is obtained 
by subtracting the previous symmetrized frame from the current symmetrized frame. 
We calculate the inner-products between the symmetrized images and a set of basis 
images. Our basis images were 10x15 shifted Gaussian kernels with a standard 
deviation of 3 pixels. The loadings of the symmetrized image and the differential 
image are combined to form the final observation frame. Each of these composite 
frames has 300 dimensions (2x10x15). The process is explained in more detail in 
Movellan (1995). 
Auditory processing LPC/cepstral analysis is used for the auditory front-end. 
First, the auditory signal is passed through a first-order emphasizer to spectrally 
flatten it. Then the signal is separated into non-overlapping frames at 30 frames 
per second. This is done so that there are an equal number of visual and auditory 
feature vectors for each utterance, which are then synchronized with each other. On 
each frame we perform the standard LPC/cepstral analysis. Each 30 msec auditory 
frame is characterized by 26 features: 12 cepstral coefficients, 12 delta-cepstrals, 
i log-power, and I delta-log-power. Each of the 26 features is encoded with 8-bit 
accuracy. 
746 J. Movellan and P. Mineiro 
Figure 1: Examples of the different occlusion levels, from left to right: 0%, 10%, 
20%, 40%, 60%, 80%. Percentages are in terms of area. 
Recognition Engine In previous work (Chadderdon & Movellan, 1995) a wide 
variety of HMM architectures were tested on this database including architectures 
that did not assume conditional independence. Optimal performance was found 
with independent A and V modules using variance matrices of the form aI, where 
a is a scalar and I the identity matrix. The best A models had 5 states and 7 
mixtures per state and the best V models had 3 states and 3 mixtures per state. 
We also determined the optimal weight of A and V modules. Optimal performance 
is obtained by weighting the output of V times 0.18. 
Factorial Contamination Experiment In this experiment we used the pre- 
viously optimized architecture and compared its performance under 64 different 
conditions using the standard and the competitive fusion approaches. We used a 
2 x 8 x 8 factoffal design, the first factor being the fusion rule, and the second and 
third factors the context in the audio and video channels. To our knowledge this is 
the first time an AVSR system is tested with a factoffal experimental design with 
both A and V contaminated at various levels. The independent variables were: 
1. Fusion rule: Classical, and competitive fusion. 
2. Audio Context: Inexistent, clean, or contaminated at one of the following 
signal to noise ratios: 12 Db, 6 Db, 0 Db, -6 Db, -12 Db and -100 Db. The 
contamination was done with audio digitally sampled from the interior of 
a car while running on a busy highway with the doors open and the radio 
on a talk-show station. 
3. Video Context: Inexistent, clean or occluded by a grey level patch. The 
percentages of visual area occupied by the patch were 10%, 20%, 40%, 60%, 
80% and 100% (see Figure 1). 
The dependent variable was performance on the digit recognition task evaluated in 
terms of generalization to new speakers. In all cases training was done with clean 
signals and testing was done with one of the 64 contexts under study. Since the 
training sample is small, generalization performance was estimated using a jackknife 
procedure (Efron, 1982). Models were trained with 11 subjects, leaving a different 
subject out for generalization testing. The entire procedure was repeated 12 times, 
each time leaving a different subject out for testing. Statistics of generalization 
performance are thus based on 96 generalization trials (4 digits x 12 subjects x 
2 observations per subject). Standard statistical tests were used to compare the 
classical and competitive context rules. 
The results of this experiment are displayed in Table 1. Note how the experiment 
replicates the phenomenon of catastrophic fusion. With the classic approach, when 
one of the channels is contaminated, performance after fusion can be significantly 
Bayesian Robustification for Audio Visual Fusion 747 
Performance wth Corapettive Fusion 
Audio 
Video None Clean 12 Db 6 Db 0 Db -6 Db -12 Db -100 Db 
None -- 95.83 95.83 90.62 80.20 67.70 42.70 19.80 
Clean 84.37 97.92 97.92 94.80 90.62 89.58 81.25  
10% 73.95 93.75 93.75 94.79 87.50 80.20 71.87 
20% 62.50 96.87 96.87 94.79 89.58 80.20 6- 41.66 
40% 37.50 93.75 89.58 87.50 83.30 70.83 43.75 30.20 
60% 34,37 93.75 91.66 88.54 82.29 65.62 42.70 26.04 
80% 27.00 95.83 90.62 86.45 79.16 64.58 46 25.00 
100% 25.00 93.75 92.71 84.37 78.12 63.54 44.79 26.04 
Performance wth Classic Fusion 
Audio 
Video None Clean 12 Db 6 Db 0 Db -6 Db -12 Db -100 Db 
None -- 95.83 94.79 89.58 79.16 65.62 40.62 20.83 
Clean 86.45 98.95 96.87 95.83 93.75 87.50 79.16  
10% 73.95 93.75 93.75 93.75 89.58 79.16 70.83 
20% 54.16 89.58 84.41 84.37 84.37 75.00 51- 43.00 
40% 29.16 81.25 78,12 78.12 67.20 52.08 38.54 34,37 
60% 32.29 77.08 77.08 72.91 62.50 47.91 37.50 29.16 
80% 29.16 70.83 72,91 68.75 54.16 44.79 3 28.12 
100% 25.00 61.46 61.45 58.33 51.04 42.70 38.54 29.16 
Table 1: Average generalization performance with standard and competitive fusion. 
Boxed cells indicate a statistically significant difference a -- 0.05 between the two 
fusion approaches. 
worse than performance with the clean channel alone. For example, when the audio 
is clean, the performance of the audio-only system is 95.83%. When combined with 
bad video (100% occlusion), this performance drops down to 61.46%, a statistically 
significant difference, F(1,11) - 132.0, p < 10 -6. Using competitive fusion, the 
performance of the joint system is 93.75%, which is not significantly different from 
the performance of the A system only, F(1,11) = 2.4, p- 0.15. The table shows 
in boxes the regions for which the classic and competitive fusion approaches were 
significantly different (a = 0.05). Contrary to the classic approach, the competitive 
approach behaves robustly in all tested conditions. 
4 Discussion 
Catastrophic fusion may occur when the environment is non-stationary forcing mod- 
ules to operate outside their assumed context. The reason for this problem is that in 
the absence of a context model, deviations from the expected context are interpreted 
as information about the different perceptual interpretations instead of information 
about contextual changes. We explored a principled solution to this problem in- 
spired by the Bayesian ideas of robustification (Box, 1980) and competitive models 
(Clark & Yuille, 1990). Each module was provided with simple white-noise context 
models and the most probable context and perceptual hypothesis were jointly es- 
timated. Consequently, context deviations are interpreted as changes in the white 
noise contamination strength, automatically adjusting the influence of the module. 
The approach worked very well on a fixed lexicon AVSR problem. 
References 
Adjondani, A. & Benoit, C. (1996). On the Integration of Auditory and Visual 
Parameters in an HMM-based ASR. In D. G. Stork & M. E. Hennecke (Eds.), 
Speechreading by Humans and Machines: Models, Systems, and Applications, 
pages 461-471. New York: NATO/Springer-Verlag. 
Bernstein, L. & Benoit, C. (1996). For Speech Perception Three Senses are Bettern 
748 J. Movellan and P. Mineiro 
than One. In Proc. of the Jth Int. Conf. on Spoken Language Processing, 
Philadelphia, PA., USA. 
Box, G. E. P. (1980). Sampling and Bayes inference in scientific modeling. J. Roy. 
Star. Soc., A., 1.43, 383-430. 
Bregler, C., Hild, H., Manke, S., & Waibel, A. (1993). Improving Connected Letter 
Recognition by Lipreading. In Proc. Int. Conf. on Acoust., Speech, and Signal 
Processing, volume 1, pages 557-560, Minneapolis. IEEE. 
Biilthoff, H. H. & Yuille, A. L. (1996). A Bayesian framework for the integration 
of visual modules. In T. Inui & J. L. McClelland (Eds.), Attention and perfor- 
mance XVI: Information integration in perception and communication, pages 
49-70. Cambridge, MA: MIT Press. 
Chadderdon, G. & Movellan, J. (1995). Testing for Channel Independence in Bi- 
modal Speech Recognition. In Proceedings of 2nd Joint Symposium on Neural 
Computation, pages 84-90. 
Clark, J. J. & Yuille, A. L. (1990). Data Fusion for Sensory Information Processing 
Systems. Boston: Kluwer Academic Publishers. 
Dampster, A. P., Laird, N.M., & Rubin, D. B. (1977). Maximum likelihood from 
incomplete data via the EM algorithm. J. Roy. Star. Soc., 39, 1-38. 
Efron, A. (1982). The jacknife, the bootstrap and other resampling plans. Philadel- 
phia, Pennsylvania: SIAM. 
Gray, M. S., Mo9ellan, J. R., & Sejnowski, T. (1997). Dynamic features for visual 
speechreading: A systematic comparison. In Mozer, Jordan, & Petsche (Eds.), 
Advances in Neural Information Processing Systems, volume 9. MIT Press. 
Movellan, J. R. (1995). Visual speech recognition with stochastic neural networks. 
In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information 
processing systems. Cambridge,Massacusetts: MIT Press. 
Movellan, J. R. & Chadderdon, G. (1996). Channel Separability in the Audio 
Visual Integration of Speech: A Bayesian Approach. In D. G. Stork & M. E. 
Hennecke (Eds.), Speechreading by Humans and Machines: Models, Systems, 
and Applications, pages 473-487. New York: NATO/Springer-Verlag. 
Movellan, J. R. & Prayaga, R. S. (1996). Gabor Mosaics: A description of Local 
Orientation Statistics with Applications to Machine Perception. In G. W. 
Cottrell (Ed.), proceedings of the Eight Annual Conference of the Cognitive 
Science Society, page 817. Mahwah, New Jersey: LEA. 
O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics: Volume 2B, Bayesian 
Inference. volume 2B. Cambridge University Press. 
Wolff, G. J., Prasad, K. V., Stork, D. G., & Hennecke, M. E. (1994). Lipreading by 
Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In 
J. D. Cowan, G. Tesauro, 2 J. Alspector (Eds.), Advances in Neural Informa- 
tion Processing Systems, volume 6, pages 1027-1034. Morgan Kaufmann. 
