Probabilistic Image Sensor Fusion 
Ravi K. Sharma , Todd K. Leen 2 and Misha Pavel  
1Department of Electrical and Computer Engineering 
2Department of Computer Science and Engineering 
Oregon Graduate Institute of Science and Technology 
P.O. Box 91000, Portland, OR 97291-1000 
Email: {ravi,pavel}@ece.ogi.edu, tleen@cse.ogi.edu 
Abstract 
We present a probabilistic method for fusion of images produced 
by multiple sensors. The approach is based on an image formation 
model in which the sensor images are noisy, locally linear functions 
of an underlying, true scene. A Bayesian framework then provides 
for maximum likelihood or maximum a posteriori estimates of the 
true scene from the sensor images. Maximum likelihood estimates 
of the parameters of the image formation model involve (local) 
second order image statistics, and thus are related to local principal 
component analysis. We demonstrate the efficacy of the method 
on images from visible-band and infrared sensors. 
1 Introduction 
Advances in sensing devices have fueled the deployment of multiple sensors in several 
computational vision systems [1, for example]. Using multiple sensors can increase 
reliability with respect to single sensor systems. This work was motivated by a 
need for an aircraft autonomous landing guidance (ALG) system [2, 3] that uses 
visible-band, infrared (IR) and radar-based imaging sensors to provide guidance 
to pilots for landing aircraft in low visibility. IR is suitable for night operation, 
whereas radar can penetrate fog. The application requires fusion algorithms [4] to 
combine the different sensor images. 
Images from different sensors have different characteristics arising from the varied 
physical imaging processes. Local contrast may be polarity reversed between visible- 
band and IR images [5, 6]. A particular sensor image may contain local features 
not found in another sensor image, i.e., sensors may report complementary features. 
Finally, individual sensors are subject to noise. Fig. l(a) and l(b) are visible-band 
and IPr images respectively, of a runway scene showing polarity reversed (rectangle) 
Probabilistic Image Sensor Fusion 825 
and complementary (circle) features. These effects pose difficulties for fusion. 
An obvious approach to fusion is to average the pixel intensities from different 
sensors. Averaging, Fig. l(c), increases the signal to noise ratio, but reduces the 
contrast where there are polarity reversed or complementary features [7]. 
Transform-based fusion methods [8, 5, 9] select from one sensor or another for fusion. 
They consist of three steps: (i) decompose the sensor images using a specified 
transform e.g. a multiresolution Laplacian pyramid, (ii) fuse at each level of the 
pyramid by selecting the highest energy transform coefficient, and (iii) invert the 
transform to synthesize the fused image. Since features are selected rather than 
averaged, they are rendered at full contrast, but the methods are sensitive to sensor 
noise, see Fig. l(d). 
To overcome the limitations of averaging or selection methods, and put sensor fusion 
on firm theoretical grounds, we explicitly model the production of sensor images 
from the true scene, including the effects of sensor noise. From the model, and 
sensor images, one can ask What is the most probable true scene? This forms 
the basis for fusing the sensor images. Our technique uses the Laplacian pyramid 
representation [5], with the step (ii) above replaced by our probabilistic fusion. A 
similar probabilistic framework for sensor fusion is discussed in ([10]). 
2 The hnage Formation Model 
The true scene, denoted s, gives rise to a sensor image through a noisy, non-linear 
transformation. For ALG, s would be an image of the landing scene under conditions 
of uniform lighting, unlimited visibility, and perfect sensors. We model the map from 
the true scene to a sensor image by a noisy, locally affine transformation whose 
parameters are allowed to vary across the image (actually across the Laplacian 
pyramid) 
where, s is the true scene, ai is i th sensor image, {  (, y, k) is the hyperpixel 
location, with , y the pixel coordinates and k the level of the pyramid, t is the 
time, a is [he sensor offset, fl is the sensor gain (which includes the effects of local 
polariW reversals and complementarity), and e is the (zero-mean) sensor noise. To 
simplify no[a[ion, we adopt the matrix form 
a: + a + e (2) 
where a : [1, 2,.- q]T  : [1, 2,'' , q]T 
, . , . ,sisa 
scalar and  = [, 2, ..., q]T and we have dropped the reference to location and 
[ime. 
Since the image formation parameters , a, and the sensor noise covariance E can 
vary from hyperpixel to hyperpixel, [he model can express local polarity reversals, 
complementary features, spatial variation of sensor gain, and noise. 
We do assume, however, that [he image formation parameters and sensor noise 
distribution vary slowIs with location . Hence, a particular set of parameters is 
considered o hold [rue over a spatial region of several square hyperpixels. We will 
use this assumption implicitly when we estimate [hese parameters from data. 
The model (2) fi[s [he kamework of the factor analysis model in statistics [11, 
12]. Here the hyperpixel values of the true scene s are the latent variables or 
Specifically the parameters vary slowly on the spatio-[emporal scales over which the 
[rue scene s may exhibit large variations. 
826 R. K. Sharma, T. K. Leen and M. Pavel 
common factors, /3 contains the factor loadings, and the sensor noise e values are 
the independent factors. Estimation of the true scene is equivalent to estimating 
the common factors from the observations a. 
3 Bayesian Fusion 
Given the sensor intensities a, we will estimate the true scene s by appeal to a 
Bayesian framework. We assume that the probability density function of the latent 
variables s is a Gaussian with local mean s0(,t) and local variance cr(, t). An 
attractive benefit of this setup is that the prior mean so might be obtained from 
knowledge in the form of maps, or clear-weather images of the scene. Thus, such 
database information can be folded into the sensor fusion in a natural way. 
The density on the sensor images conditioned on the true scene, P(a[s), is normal 
with mean/3 s+ct and covariance e = diag[re2  r2 re2q] The marginal density 
, C2,*** , * 
P(a) is normal with mean Pm=/3 so + a and covariance 
C - lie 4-r/3/3 (3) 
Finally, the posterior density on s, given the sensor data a, 79(sla ) is also normal 
--1 T --1 --1 T --1 
with mean M (fi Ii e (a-a)+so/a2), and covariance M = (/3 lie /34- 1/rs2)-1' 
Given these densities, there are two obvious candidates for probabilistic fusion: 
maximum likelihood (ML)  = max 7(a[s), and maximum a posterJori (MAP) 
- max 7(s[a). 
The MAP fusion estimate is simply the posterior mean 
(4) 
which for two sensors reads 
+ + + + (5) 
To obtain the ML fusion estimate we take the limit cr -+ oo in either (4) or (5). 
For both ML and MAP, the fused image  is a locally linear combination of the sensor 
images that can, through the spatio-temporal variations in/3, a, and lie, properly 
respond to changes in the sensor characteristics that tax averaging or selection 
schemes. For example, if the second sensor has a polarity reversal relative to the 
first, then  is negative and the two sensor contributions are properly subtracted. 
If the first sensor h high noise (large ), its contribution to the fused image is 
attenuated. Finally, a feature missing from sensor i corresponds to 1 : 0. The 
model compensates by accentuating the contribution from sensor 2. 
4 Model Parameter Estimates 
We need to estimate the local image formation model parameters a(, t),/3(/', t) and 
the local sensor noise covariance'lie(/',t). We estimate the latter from successive, 
motion compensated video frames from each sensor. First we estimate the average 
value at each hypeipixel (aY(t)), and the average square (-?(t)) by exponential 
moving averages. We next estimate the noise variance by the difference cre (t) = 
a(t) - a?(t). 
2 
To estimate /3 and a, we assume that /3, a, lie, so and er are nearly constant 
over small spatial regions (5 x 5 blocks) surrounding the hyperpixel for which the 
Probabilistic Image Sensor Fusion 827 
parameters are desired. Essentially we are invoking a spatial analog of ergodicity, 
where ensemble averages are replaced by spatial averages, carried out locally over 
regions in which the statistics are approximately constant. 
To form a maximum likelihood (ML) estimate of a, we extremize the data log- 
N 
likelihood/2 = 5-.,= log[(a.)] with respect to a to obtain 
ML=/a--fiSo , (6) 
where/a is the data mean, computed over a 5 x 5 hyperpixel local region (N - 25 
points). 
To obtain a ML estimate of fi, we set the derivatives of/2 with respect to  equal 
to zero and recover 
(c - r,)c-'fi = 0 
(7) 
where E; is the data covariance matrix, also computed over a 5 x 5 hyperpixel local 
region. The only non-trivial solution to (7) is 
1) 
= P,'U (8) 
where O,  are the principal eigenvector and eigenvalue of the weighted data co- 
_1 _1 
variance matrix, l, -- Yl, ' Yl,Yl, ', and r = 4-1. 
An alternative to maximum likelihood estimation is the least squares (LS) ap- 
proach [11]. We obtain the LS estimate aus by minimizing 
(9) 
with respect to c. This gives 
:m -fio (10) 
The least squares estimate fi[s is obtained by minimizing 
(11) 
with respect to fl. The solution to this minimization is 
(12) 
where U, A are the principal eigenvector and eigenvalue of the noise-corrected co- 
variance matrix (Yla - Yle), and r: 4-1. 2 
2 and So. Were we 
The estimation procedures cannot provide values of the priors rr, 
dealing with a single global model, this would pose no problem. But we must impose 
a constraint in order to smoothly piece together our local models. We impose that 
2 ,k. Recall that A is the leading eigenvalue of 
[I/11 = 1 everywhere, or by (12) ors - 
a --  and thus captures the scale of variations in a that arise from variations in 
2 Our constraint insures that the proportionality 
s. Thus we would expect ,k oc ra. 
constant be the same for each local model. Next, note that changing so causes a shift 
2The least squares and maximum likelihood solutions are identical when the model is 
exact E = C, i.e. the observed data covariance is exactly of the form dictated by the 
model. Under this condition, 0 = (uT-,:iu)-I/2Ei-1/2U and (- 1) = /(uTE-iu). 
The LS and ML solutions are also identical when the noise covariance is homoscedastic 
E = a I, even if the model is not exact. 
828 R. K. Sharma, T. K. Leen and M. Pavel 
in $. To maintain consistency between local regions, we take so = 0 everywhere. 
2 and so constrain the parameter estimates to 
These choices for 
fiLs = r U and 
aLs = /a (13) 
2 and so are defined at each hyperpixel. However, to estimate 
(5) 
we used spatial averages to compute the sample mean and covariance. This is 
somewhat inconsistent, since the spatial variation of so (e.g. when there are edges 
in the scene) is not explicitly captured in the model mean and covariance. These 
2 resulting in overestimation of the latter. 
variations are, instead, attributed to 
A more complete model would explicitly model the spatial variations of so, though 
we expect this will produce only minor changes in the results. 
Finally, the sign parameter r is not specified. In order to properly piece together 
our local models, we must choose r at each hyperpixel in such a way that fi changes 
direction slowly as we move from hyperpixel to hyperpixel and encounter changes 
in the local image statistics. That is, large direction changes due to arbitrary sign 
reversals are not allowed. We use a simple heuristic to accomplish this. 
5 Relation to PCA 
The MAP and ML fusion rules are closely related to PCA. To see this, assume that 
2I and use the parameter estimates (13) in the 
the noise is homoscedastic Ee: rr 
MAP fusion rule (5), reducing the latter to 
1 T 
-- 1+ 2 2 Ua(a 
0' e/0' 
where Ua is the principal eigenvector of 
estimate  is simply a scaled and shifted 
1 
-/) + 1+ 2 2 so (14) 
O's/0' e 
the data covariance matrix Ea. The MAP 
local PCA projection of the sensor data. 
Both the scaling and shift arise because the prior distribution on s tends to bias 
2 (or equivalently when using the ML 
towards so. When the prior is fiat or, -+ oo, 
fusion estimate), or when the noise variance vanishes, the fused image is given by 
simple local PCA projection 
T 
(a- 
6 Experiments and Results 
We applied our fusion method to visible-band and IP runway images, Fig. 1, con- 
taining additive Gaussian noise. Fig. l(e) shows the result of ML fusion with  
and e estimated using (13). ML fusion performs better than either averaging or 
selection in regions that contain local polarity reversals or complementary features. 
ML fusion gives higher weight to IR in regions where the features in the two im- 
ages are common, thus reducing the effects of noise in the visible-band image. ML 
fusion gives higher weight to the appropriate sensor in regions with complementary 
features. 
2 and so those dictated 
Fig. l(f) shows the result of MAP fusion (5) with the priors 
by the consistency requirements discussed in section 4. Clearly, the MAP image is 
2 is low 
less noisy than the ML image. In regions of low sensor image contrast, 
(since ,k is low), thus the contribution from the sensor images is attenuated compared 
to the ML fusion rule. Hence the noise is attenuated. In regions containing features 
2 is high (since ,k is high); hence the contribution from the sensor 
such as edges, 
images is similar to that in ML fusion. 
Probabilistic Image Sensor Fusion 829 
(a) Visible-band image 
(d) Selection 
(b) IR image 
 . 
 - .:,s'-:' .... .- 'o::-'-".,,.'.:. ' 
;;?.--: .,; ..... 
?':;'.:': ....... .:..:.-:,..... ..... .... ...... .: . 
...... :...,. . ;. .%, 
./::;? '.';: .. . 
 :;.::. ,,"'... 
"- :' ,.. ,.. 
(e) M[ 
(c) Averaging 
...**.. ....... ----.--.;:. 
(f) MAP 
Figure 1: Fusion of visible-band and IR images containing additive Gaussian noise 
In Fig. 2 we demonstrate the use of a database image for fusion. Fig. 2(a) and 2(b) 
are simulated noisy sensor images from visible-band and IR, ha depic a runway 
with an aircraft on it. Fig. 2(c) is an image of the same scene as migh be obtained 
from a terrain database. Although his image is clean, it, does no[ show the actual 
situation on the runway. One can use the database image pixel intensities as he 
2 in (5) can be 
prior mean so in the MAP fusion rule (5). The prior variance a, 
regarded as a masure of confidence in he database image - it's value con[rols he 
relative contribution of the sensors vs. he database image in [he fused image. (The 
parameters/9 and a, and the sensor noise covariance E were estimated exactly 
as before.) Fig. 2(d), 2(e) and 2(0 show the MAP-fused image as a function of 
increasing tr, 2. Higher values of or, 2 accentuate the contribution of he sensor images, 
2 accentuate the contribution of he database. 
whereas lower values of tr, 
7 Discussion 
We presented a model-based probabilistic framework for fusion of images from multi- 
ple sensors and exercised the approach on visible-band and IP images. The approach 
provides both a rigorous framework for PCA-like fusion rules, and a principled way 
[o combine information from a [errain database with sensor images. 
We envision several refinements of the approach given here. Writing new image 
formation models at each hyperpixel produces an overabundance of models. Early 
experiments show hat [his can be relaxed by using [he same model parameters over 
regions of several square hyperpixels, rather than recalculating for each hyperpixel. 
A further refinement could be provided by adopting a mixture of linear models o 
build up the non-linear image formation model. Finally, we have used multiple 
frames from a video sequence o obtain ML and MAP fused sequences, and one 
should be able to produce superior parameter estimates by suitable use of he video 
sequence. 
830 R. K. Sharma, T. K. Leen and M. Payel 
n.. .... -;,'  " -..-. :4"'. '. . '- ' .- .. 
(a) Visible-band image 
-5 '. ' .' '  .  ' : ' i x ' :... 
v :'.:-?' .. .,-''::ff'--.-  '. 
.-. .... .-....  . 
(b) IR image (c) Database image 
e e larger 
(d) MAP with r, small (e) MAP , 
.:.- ...... ... :, ...,. 
'  '', 'V".'.' '  .'-'."'{ 
;', 
e lger still 
Figure 2: Fusion of simulated visible-band and IR images using database image 
Acknowledgments - This work was supported by NASA Ames Research Center 
grant NCC2-811. TKL was partially supported by NSF grant ECS-9704094. 
References 
[1] L. A. Klein. Sensor and Data Fusion Concepts and Applications. SPIE, 1993. 
[2] J. R. Kerr, D. P. Pond, and S. Inman. Infrared-optical multisensor for autonomous 
landing guidance. Proceedings of SPIE, 2463:38-45, 1995. 
[3] B. Roberts and P. Symosek. Image processing for flight crew situation awareness. 
Proceedings of SPIE, 2220:246-255, 1994. 
[4] M. Pavel and R. K. Sharma. Model-based sensor fusion for aviation. In J. G. Verly, 
editor, Enhanced and Synthetic Vision 1997, volume 3088, pages 169-176. SPIE, 1997. 
[5] P. J. Burt and R. J. Kolczynski. Enhanced image capture through fusion. In Fourth 
Int. Conf. on Computer Vision, pages 173-182. IEEE Comp. Soc., 1993. 
[6] H. Li and Y. Zhou. Automatic visual/IR image registration. Optical Engineering, 
35(2):391-400, 1996. ' 
[7] M. Pavel, J. Larimer, and A. Ahumada. Sensor fusion for synthetic vision. In Pro- 
ceedings of the Society for Information Display, pages 475-478. SPIE, 1992. 
[8] P. Burr. A gradient pyramid basis for pattern-selective image fusion. In Proceedings 
of the Society for Information Display, pages 467-470. SPIE, 1992. 
[9] A. Toet. Hierarchical image fusion. Machine Vision and Applications, 3:1-11, 1990. 
[10] J. J. Clark and A. L. Yui]le. Data Fusion for Sensory Information Processing Systems. 
Kluwer, Boston, 1990. 
Ill] A. Basilevsky. Statistical Factor Analysis and Related Methods. Wiley, 1994. 
[12] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Tech- 
nical report, NCRG/97/010, Neural Computing Research Group, Aston University, 
UK, 1997. 
