Dual Estimation and the Unscented 
Transformation 
Eric A. Wan 
ericwan @ ece. ogi. edu 
Rudolph van der Merwe 
rudmerwe @ ece. ogi. edu 
Alex T. Nelson 
atnelson @ ece. ogi. edu 
Oregon Graduate Institute of Science & Technology 
Department of Electrical and Computer Engineering 
20000 N.W. Walker Rd., Beaverton, Oregon 97006 
Abstract 
Dual estimation refers to the problem of simultaneously estimating the 
state of a dynamic system and the model which gives rise to the dynam- 
ics. Algorithms include expectation-maximization (EM), dual Kalman 
filtering, and joint Kalman methods. These methods have recently been 
explored in the context of nonlinear modeling, where a neural network 
is used as the functional form of the unknown model. Typically, an ex- 
tended Kalman filter (EKF) or smoother is used for the part of the al- 
gorithm that estimates the clean state given the current estimated model. 
An EKF may also be used to estimate the weights of the network. This 
paper points out the flaws in using the EKF, and proposes an improve- 
ment based on a new approach called the unscented transformation (UT) 
[3]. A substantial performance gain is achieved with the same order of 
computational complexity as that of the standard EKF. The approach is 
illustrated on several dual estimation methods. 
1 Introduction 
We consider the problem of learning both the hidden states x and parameters w of a 
discrete-time nonlinear dynamic system, 
Xkq- 1 --- F(xk,vk,w) (1) 
= (2) 
where y is the only observed signal. The process noise v drives the dynamic system, and 
the observation noise is given by n. Note that we are not assuming additivity of the noise 
sources. 
A number of approaches have been proposed for this problem. The dual EKF algorithm 
uses two separate EKFs: one for signal estimation, and one for model estimation. The states 
are estimated given the current weights and the weights are estimated given the current 
states. In the joint EKF, the state and model parameters are concatenated within a combined 
state vector, and a single EKF is used to estimate both quantities simultaneously. The 
EM algorithm uses an extended Kalman smoother for the E-step, in which forward and 
Dual Estimation and the Unscented Transformaa'on 667 
backward passes are made through the data to estimate the signal. The model is updated 
during a separate M-step. 
For a more thorough treatment and a theoretical basis on how these algorithms relate, see 
Nelson [6]. Rather than provide a comprehensive comparison between the different algo- 
rithms, the goal of this paper is to point out the assumptions and flaws in the EKF (Sec- 
tion 2), and offer a improvement based on the unscented transformation/filter (Section 3). 
The unscented filter has recently been proposed as a substitute for the EKF in nonlinear 
control problems (known dynamic model) [3]. This paper presents new research on the use 
of the UF within the dual estimation framework for both state and weight estimation. In 
the case of weight estimation, the UF represents a new efficient "second-order" method for 
training neural networks in general. 
2 Flaws in the EKF 
Assume for now that we know the model (weight parameters) for the dynamic system in 
Equations 1 and 2. Given the noisy observation y, a recursive estimation for  can be 
expressed in the form, 
(optimal prediction of x) + G x [yk - (optimal prediction of y)] (3) 
This recursion provides the optimal MMSE estimate for x assuming the prior estimate 
and current observation y are Gaussian. We need not assume linearity of the model. The 
optimal terms in this recursion are given by 
= 
G: PxyP;Js,  
.; = E[H(c,n)], (4) 
where the optimal prediction : is the expectation of a nonlinear function of the random 
variables c_ and v_ (similar interpretation for the optimal prediction of y). The 
optimal gain term is expressed as a function of posterior covariance matrices (with 2 - 
y - .). Note these terms also require taking expectations of a nonlinear function of the 
prior state estimates. 
The Kalman filter calculates these quantities exactly in the linear case. For nonlinear mod- 
els, however, the extended KF approximates these as: 
  F(/c_,9) G m PxyP:J- r- = H( fi), (5) 
y y  
where predictions are approximated as simply the function of the prior mean value for es- 
timates (no expectation taken). The covariance are determined by linearizing the dynamic 
equations (x+ m Axk + Bye, y  CXk + Dn), and then determining the posterior 
covariance matrices analytically for the linear system. As such, the EKF can be viewed 
as providing "first-order" approximations to the optimal terms (in the sense that expres- 
sions are approximated using a first-order Taylor series expansion of the nonlinear terms 
around the mean values). While "second-order" versions of the EKF exist, their increased 
implementation and computational complexity tend to prohibit their use. 
3 The Unscented Transformation/Filter 
The unscented transformation (UT) is a method for calculating the statistics of a random 
variable which undergoes a nonlinear transformation [3]. Consider propagating a random 
variable c (dimension L) through a nonlinear function,  = g(c). Assume c has mean & 
and covariance P,. To calculate the statistics of , we form a matrix X of 2L + 1 sigma 
vectors A'i, where the first vector (A'o) corresponds to &, and the rest are computed from 
the mean (+)plus and (-)minus each column of the matrix square-root of P,. These sigma 
668 E. A. Wan, R. v. d. Merwe and A. T. Nelson 
vectors are propagated through the nonlinear function, and the mean and covariance for 1 
are approximated using a weighted sample mean and covariance, 
1 tg('o) + g(rt'i) , (6) 
t, 
1 n[g(&) - 3][g(&) - 3] T +  [g(Xi) - 3][g(Xi) - B] T (7) 
PZ m L + i= 
where n is a scaling hcton Note that this method differs substantially from general "sta- 
pling" methods (e.g., Monte-C1o methods and pticle filters [1]) which requffe orders 
of magnitude more staple poin in an attempt to propagate an accurate (possibly non- 
Gaussian) distribution of the state. The UT approximations e accuram to the third order 
for Gaussian inputs for all nonlineities. For non-Gaussian inputs, approximations e 
accurate to at least the second-order, with the accuracy detemined by the choice of n [3]. 
A simple exmple is shown in Figure 1 for a 2-dimensional system: the left plots shows 
e e mean and coviance propagation using Monte-C1o stapling; the center plots 
show e performce of the  (note only 5 sigma points e requffed); the right plo 
show e resulB using a lineization approach as would be done in e E. e superior 
performance of the UT is clear. 
Actual l[sampling) 
true mean 
UT Linearized (EKF) 
sigma points 
y = g(xd 
1 
UT 
tme covariance / me?. UT coiariance 
transformed sigma points 
P = ArPA 
I 
ATPaA 
Figure 1: Example of the UT for mean and covariance propagation. 
UT, c) first-order linear (EKF). 
a) actual, b) 
The unscented filter (UF) [3] is a straightforward extension of the UT to the recursive 
estimation in Equation 3, where we set c - , and denote the corresponding sigma matrix 
as X(k[k). The UF equations are given on the next page. It is interesting to note that no 
explicit calculation of Jacobians or Hessians are necessary to implement this algorithm. 
The total number of computations is only order L 2 as compared to L  for the EKF.  
4 Application to Dual Estimation 
This section shows the use of the UF within several dual estimation approaches. As an ap- 
plication domain for comparison, we consider modeling a noisy time-series as a nonlinear 
Note that a matrix square-root using the Cholesky factorization is of order La/. However, the 
covariance matrices are expressed recursively, and thus the square-root can be computed in only order 
L 2 by performing a recursive update to the Cholesky factorization. 
Dual Estimation and the Unscented Transformation 669 
UF Equations 
Wo = ,q( + ,) , w... w2L = 1/2( + ) 
x(l 1) r[x( 11 1) / 
--  -- -- vv j 
i:o WZ(klk - 1) 
-- x k ] 
E=o m[&(l - 1) - r][(l 1) - - r 
p = 2 
y(l 1) [(1 1) / 
-- 2L 
y = =o wY(I- 1) 
2L 
_ -- y ] 
P - E=o w[Y(l- 1) ;][Y(I- 1)- - r 
2L 
= Yk ] 
P E=o w[(l - 1) - ][y(l - 1) - - r 
k = x + Pxy Pg (Y - ) 
P P Pxy -1 ]rpT 
= - (Py y 
autoregression: 
x =  (x_, ...x_, w) +  
y = x + n, Vk  {1...N} 
(8) 
The underlying clean signal xk is a nonlinear function of its past M values, driven by 
2 The observed data point yk includes the 
white Gaussian process noise v with variance a v. 
2 The corresponding 
additive noise n, which is assumed to be Gaussian with variance a s. 
state-space representation for the signal z is given by: 
Xk 
Xk--1 
Xk--M+i.] 
= + (9) 
/(xk_,w) 
o o I 
q- 
B  Uk-- 1 
y=[1 0 -.- 0].x+n (10) 
In this context, the dual estimation problem consists of simultaneously estimating the clean 
signal xk and the model parameters w from the noisy data y. 
4.1 Dual EKF / Dual UF 
One dual estimation approach is the dual extended Kalman filter developed in [8, 6]. The 
dual EKF requires separate state-space representation for the signal and the weights. A 
state-space representation for the weights is generated by considering them to be a station- 
ary process with an identity state transition matrix, driven by process noise u: 
w = w_ +uk (11) 
y = f(x_, w) + v + nk. (12) 
The noisy measurement y has been rewritten as an observation on w. This allows the use 
of an EKF for weight estimation (representing a "second-order" optimization procedure) 
[7]. Two EKFs can now be run simultaneously for signal and weight estimation. At every 
time-step, the current estimate of the weights is used in the signal-filter, and the current 
estimate of the signal-state is used in the weight-filter. 
670 E.A. Wan, R. v. d. Merwe and A. T. Nelson 
The dual UF/EKF algorithm is formed by simply replacing the EKF for state-estimation 
with the UF while still using an EKF for weight-estimation. In the dual UF algorithm both 
state- and weight-estimation are done with the UF. Note that the state-transition is linear in 
the weight filter, so the nonlinearity is restricted to the measurement equation. Here, the 
UF gives a more exact measurement-update phase of estimation. The use of the UF for 
weight estimation in general is discussed in further detail in Section 5. 
4.2 Joint EKF / Joint UF 
An alternative approach to dual estimation is provided by the joint extended Kalmanfilter 
[4, 5]. In this framework the signal-state and weight vector are concatenated into a single, 
joint state vector: zk [xT T T 
= Wk ] . The estimation of z can be done recursively by writing 
the state-space equations for the joint state as: 
[F(x_,w_)] rB.v] and yk = [1 0 .-. 0]z +n, (13) 
z = [ I.w_ J + L uk j 
and running an EKF on the joint state-space to produce simultaneous estimates of the states 
x and w. As discussed in [6], the joint EKF provides approximate MAP estimates by 
maximizing the joint density of the signal and weights given the noisy data. Again, our ap- 
proach in this paper is to use the UF instead of the EKF to provide more accurate estimation 
of the state, resulting in the joint UF algorithm. 
4.3 EM - Unscented Smoothing 
A somewhat different iterative approach to dual estimation is given by the expectation- 
maximization (EM) algorithm applied to nonlinear dynamic systems [2]. In each iteration, 
the conditional expectation of the signal is computed, given the data and the current es- 
timate of the model (E-step). Then the model is found that maximizes a function of this 
conditional mean (M-step). For linear models, the M-step can be solved in closed form. 
The E-step is computed with a Kalman smoother, which combines the forward-time esti- 
mated mean and covariance Gr f Pf of the signal given past data, with the backward-time 
predicted mean and covariance (cht',Ph t') given the future data, producing the following 
smoothed statistics given all the data: 
(p)-i = (p)-i + (p.t,)- 
^ P[(P ) x 
X s _b - ^_b f --^f 
= - (Pk) xa]. 
(14) 
(15) 
When a MLP neural network model is used, the M-step can no longer be computed in 
closed-form, and a gradient-based approach is used instead. The resulting algorithm is 
usually referred to as generalized EM (GEM) 2. The E-step is typically approximated by 
an extended Kalman smoother, wherein a linearization of the model is used for backward 
propagation of the state estimates. 
We propose improving the E-step of the EM algorithm for nonlinear models by using a 
UF instead of an EKF to compute both the forward and backward passes in the Kalman 
smoothen Rather than linearize the model for the backward pass, as in [2], a neural network 
is trained on the backward dynamics (as well as the forward dynamics). This allows for 
a more exact backward estimation phase using the UF, and enables the development of an 
unscented smoother (US). 
2An exact M-step is possible using RBF networks [2]. 
Dual Estimation and the Unscented Transformation 671 
4.4 Experiments 
We present results on two simple time-series to provide a clear illustration of the use of 
the UF over the EKF. The first series is the Mackey-Glass chaotic series with additive 
WGN (SNR m 3dB). The second time series (also chaotic) comes from an autoregressive 
neural network with random weights driven by Gaussian process noise and also corrupted 
by additive WGN (SNR  3dB). A standard 5-3-1 MLP with tanh hidden activation 
functions and a linear output layer was used in all the filters. The process and measurement 
noise variances were assumed to be known. 
Results on training and testing data, as well as training curves for the different dual estima- 
tion methods are shown below. The quoted numbers are normalized (clean signal variance) 
mean-square estimation and prediction errors. The superior performance of the UT based 
algorithms (especially the dual UF) are clear. Note also the more stable learning curves 
using the UF approaches. These improvements have been found to be consistent and sta- 
tistically significant on a number of additional experiments. 
Mackey-Glass Train Test Chaotic AR-NN Train Test 
Algorithm Est. Pred. Est. Pred. 
Dual EKF 0.20 0.50 0.21 0.54 
Dual UF/EKF 0.19 0.50 0.19 0.53 
Dual UF 0.15 0.45 0.14 0.48 
Joint EKF 0.22 0.53 0.22 0.56 
Joint UF 0.19 0.50 0.18 0.53 
Mackey-Glass 
oo 
+ Dual EK F 
 Dual UF/EKF 
0.8 O Dual UF 
A .Joint UF 
a Joint EKF 
07 
nos 
1 2 3 4 5 6 7 8  1 11 
iteration 
Algorithm Est. Pred. Est. Pred. 
'Dual EKF 0.32 0.62 0.36 0.69 
Dual UF/EKF 0.26 0.58 0.28 0.69 
Dual UF 0.23 0.55 0.27 0.63 
Joint EKF 0.29 0.58 0.34 0.72 
Joint UF 0.25 0.55 0.30 0.67 
Chaotic AR-NN 
0 55 
+ Dual EKF 
t* Dual UF/EKF 
0 $ O Dual U F 
a Joint EKF 
O3 
iteration 
The final table below compares smoother performance used for the E-step in the EM algo- 
rithm. In this case, the network models are trained on the clean time-series, and then tested 
on the noisy data using either the standard Kalman smoother with linearized backward 
model (EKS 1), a Kalman smoother with a second nonlinear backward model (EKS2), and 
the unscented smoother (US). The forward (F), backward (B), and smoothed (S) estimation 
errors are reported. Again the performance benefits of the unscented approach is clear. 
Mackey-Glass Norm. MSE 
Algorithm F B S 
EKS 1 0.20 0.70 0.27 
EKS2 0.20 0.31 0.19 
US 0.10 0.24 0.08 
Chaotic AR-NN Norm. MSE 
Algorithm F B S 
EKS1 0.35 0.32 0.28 
EKS2 0.35 0.22 0.23 
US 0.23 0.21 0.16 
5 UF Neural Network Training 
As part of the dual UF algorithm, we introduced the use of the UF for weight estimation. 
The approach can also be seen as a new method for the general problem of training neural 
networks (i.e., for regression or classification problems where the input x is observed and 
672 E. A. Wan, R. v. d. Merwe and A. T. Nelson 
no state-estimation is required). The advantage of the UF over the EKF in this case is not 
as obvious, as the state-transition function is linear (See Equation 11). However, as pointed 
out earlier, the observation is nonlinear. Effectively, the EKF builds up an approximation 
to the expected Hessian by taking outer products of the gradient. The UF, however, may 
provide a more accurate estimate through direct approximation of the expectation of the 
Hessian. We have performed a number of preliminary experiments on standard benchmark 
data. The figure below shows the mean and std. of learning curves (computed over 100 
experiments with different initial weights) for the Mackay Robot Arm Mapping dataset. 
Note the faster convergence, lower variance, and lower final MSE performance of the UF 
weight training. While these results are encouraging, further study is still necessary to fully 
contrast differences between UF and EKF weight training. 
0.06 
0.05 
0,03 
0.02 
0.01 
' Learning Curves' 
 UF (mean) 
-- UF (std) 
- - EKF (mean) 
- - EKE Istd) 
opoch 
1'o ;o' 3o do 
6 Conclusions 
The EKF has been widely accepted as a standard tool in the machine learning community. 
In this paper we have presented an alternative to the EKF using the unscented filter. The 
UF consistently achieves a better level of accuracy than the EKF at a comparable level 
of complexity. We demonstrated this performance gain on a number of dual estimation 
methods as well as standard regression modeling. 
Acknowledgements 
This work was sponsored in part by the NSF under grant IRI-9712346. 
References 
[1] J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet. Sequential Monte Carlo methods 
for optimisation of neural network models. Technical Report TR-328, Cambridge University 
Engineering Department, Cambridge, England, November 1998. 
[2] Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. 
In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing 
Systems 11: Proceedings of the 1998 Conference. MIT Press, 1999. 
[3] S.J. Julier and J. K. Uhlmann. A New Extension of the Kalman Filter to Nonlinear Systems. In 
Proc. of AeroSense: The 11th International Symposium on Aerospace/Defence Sensing, Simula- 
tion and Controls, Orlando, Florida., 1997. 
[4] R. E. Kopp and R. J. Orford. Linear regression applied to system identification for adaptive 
control systems. AIAA J., 1:2300-06, October 1963. 
[5] M. B. Matthews and G. S. Moschytz. Neural-network nonlinear adaptive filtering using the 
extended Kalman filter algorithm. In INNC, pages 115-8, 1990. 
[6] A.T. Nelson. Nonlinear Estimation and Modeling of Noisy Time-Series by Dual Kalman Filter- 
ing Methods. PhD thesis, Oregon Graduate Institute, 1999. In preparation. 
[7] S. Singhal and L. Wu. Training multilayer perceptrons with the extended Kalman filter. In 
Advances in Neural Information Processing Systems 1, pages 133-140, San Mateo, CA, 1989. 
Morgan Kauffman. 
[8] E.A. Wan and A. T. Nelson. Dual Kalman filtering methods for nonlinear prediction, estimation, 
and smoothing. In Advances in Neural Information Processing Systems 9, 1997. 
