Local Dimensionality Reduction 
Stefan Schaal 1,2,4 
sschaalusc.edu 
http://www-slab.usc.edu/sschaal 
Sethu Vijayakumar 3,1 
sethucs.titech.ac.jp 
http://ogawa- 
www.cs.titech.ac.jp/~sethu 
Christopher G. Atkeson 4 
cgacc.gatech.edu 
http://www.cc.gatech.edu/ 
fac/Chris.Atkeson 
1ERATO Kawato Dynamic Brain Project (JST), 2-2 Hikaxidai, Seika-cho, Soraku-gun, 619-02 Kyoto 
2Dept. of Comp. Science & Neuroscience, Univ. of South. California HNB-103, Los Angeles CA 90089-2520 
3Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo-152 
4College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30332-0280 
Abstract 
If globally high dimensional data has locally only low dimensional distribu- 
tions, it is advantageous to perform a local dimensionality reduction before 
further processing the data. In this paper we examine several techniques for 
local dimensionality reduction in the context of locally weighted linear re- 
gression. As possible candidates, we derive local versions of factor analysis 
regression, principle component regression, principle component regression 
on joint distributions, and partial least squares regression. After outlining the 
statistical bases of these methods, we perform Monte Carlo simulations to 
evaluate their robustness with respect to violations of their statistical as- 
sumptions. One surprising outcome is that locally weighted partial least 
squares regression offers the best average results, thus outperforming even 
factor analysis, the theoretically most appealing of our candidate techniques. 
1 TRODUCTION 
Regression tasks involve mapping a n-dimensional continuous input vector x  filn onto 
a m-dimensional output vector y  fil m. They form a ubiquitous class of problems found 
in fields including process control, sensorimotor control, coordinate transformations, and 
various stages of information processing in biological nervous systems. This paper will 
focus on spatially localized learning techniques, for example, kernel regression with 
Gaussian weighting functions. Local learning offer advantages for real-time incremental 
learning problems due to fast convergence, considerable robustness towards problems of 
negative interference, and large tolerance in model selection (Atkeson, Moore, & Schaal, 
1997; Schaal & Atkeson, in press). Local learning is usually based on interpolating data 
from a local neighborhood around the query point. For high dimensional learning prob- 
lems, however, it suffers from a bias/variance dilemma, caused by the nonintuitive fact 
that "... [in high dimensions] if neighborhoods are local, then they are almost surely 
empty, whereas if a neighborhood is not empty, then it is not local." (Scott, 1992, p. 198). 
Global learning methods, such as sigmoidal feedforward networks, do not face this 
634 S. SchaaI, S. Vijayakumar and C. G. Atkeson 
problem as they do not employ neighborhood relations, although they require strong 
prior knowledge about the problem at hand in order to be successful. 
Assuming that local learning in high dimensions is a hopeless, however, is not necessar- 
ily warranted: being globally high dimensional does not imply that data remains high di- 
mensional if viewed locally. For example, in the control of robot arms and biological 
arms we have shown that for estimating the inverse dynamics of an arm, a globally 21- 
dimensional space reduces on average to 4-6 dimensions locally (Vijayakumar & Schaal, 
1997). A local learning system that can robustly exploit such locally low dimensional 
distributions should be able to avoid the curse of dimensionality. 
In pursuit of the question of what, in the context of local regression, is the "fight" 
method to perform local dimensionality reduction, this paper will derive and compare 
several candidate techniques under i) perfectly fulfilled statistical prerequisites (e.g., 
Gaussian noise, Gaussian input distributions, perfectly linear data), and ii) less perfect 
conditions (e.g., non-Gaussian distributions, slightly quadratic data, incorrect guess of 
the dimensionality of the true data distribution). We will focus on nonlinear function ap- 
proximation with locally weighted linear regression (LWR), as it allows us to adapt a va- 
riety of global linear dimensionality reduction techniques, and as LWR has found wide- 
spread application in several local learning systems (Atkeson, Moore, & Schaal, 1997; 
Jordan & Jacobs, 1994; Xu, Jordan, & Hinton, 1996). In particular, we will derive and 
investigate locally weighted principal component regression (LWPCR), locally weighted 
joint data principal component analysis (LWPCA), locally weighted factor analysis 
(LWFA), and locally weighted partial least squares (LWPLS). Section 2 will briefly out- 
line these methods and their theoretical foundations, while Section 3 will empirically 
evaluate the robustness of these methods using synthetic data sets that increasingly vio- 
late some of the statistical assumptions of the techniques. 
2 METHODS OF DIMENSIONAI,ITY REDUCTION 
We assume that our regression data originate from a generating process with two sets of 
observables, the "inputs" i and the "outputs" . The characteristics of the process en- 
sure a functional relation  = f(i). Both i and  are obtained through some measure- 
ment device that adds independent mean zero noise of different magnitude in each ob- 
servable, such that x = i + c x and y = y + cy. For the sake of simplicity, we will only fo- 
cus on one-dimensional output data (rn=l) and functions f that are either linear or 
slightly quadratic, as these cases are the most common in nonlinear function approxima- 
tion with locally linear models. Locality of the regression is ensured by weighting the er- 
ror of each data point with a weight from a Gaussian kernel: 
x denotes the query point, and D a positive semi-definite distance metric which deter- 
rmnes the size and shape of the neighborhood contributing to the regression (Atkeson et 
al., 1997). The parameters xq and D can be determined in the framework of nonparamet- 
ric statistics (Schaal & Atkeson, in press) or parametric maximum likelihood estimations 
(Xu et al, 1995). for the present study they are determined manually since their origin is 
secondary to the results of this paper. Without loss of generality, all our data sets will set 
xq to the zero vector, compute the weights, and then translate the input data such that the 
lbcally weighted mean,  = E w, xi /E wi, is zero. The output data is equally translated to 
be mean zero. Mean zero data is necessary for most of techniques considered below. The 
(translated) input data is summarized in the rows of the matrix X, the corresponding 
(translated) outputs are the elements of the vector y, and the corresponding weights are in 
the diagonal matrix W. In some cases, we need the joint input and output data, denoted 
as Z=[X y]. 
Local DimensionaIity Reduction 635 
2.1 FACTOR ANALYSIS (LWFA) 
Factor analysis (Everitt, 1984) is a technique of dimensionality reduction which is the 
most appropriate given the generating process of our regression data. It assumes the ob- 
served data z was produced by a mean zero independently distributed k -dimensional 
vector of factors v, transformed by the matrix U, and contaminated by mean zero inde- 
pendent noise e with diagonal covariance matrix 
z=Uv+e, where z= xr,y and e= ex, (2) 
If both v and  are normally distributed, the parameters C and U can be obtained itera- 
tively by the Expectation-Maximization algorithm (EM) (Rubin & Thayer, 1982). For a 
linear regression problem, one assumes that z was generated with U=[I,/5 ]r and v = 
where/ denotes the vector of regression coefficients of the linear model y =/Yx, and I 
the identity matrix. After calculating C and U by EM in joint data space as formulated in 
(2), an estimate of / can be derived from the conditional probability P{yl x). As all 
distributions are assumed to be normal, the expected value ofy is the mean of this condi- 
tional distribution. The locally weighted version (LWFA) of/ can be obtained together 
with an estimate of the factors v from the joint weighted covariance matrix W of z and v: 
U r [2(=(m+k) xn) 22(=(m+k)x(m+k)).] 
(3) 
where E{.} denotes the expectation operator and B a matrix of coefficients involved in 
estimating the factors v. Note that unless the noise e is zero, the estimated/ is different 
from the true ]3 as it tries to average out the noise in the data. 
2.2 JOINT-SPACE PRINCIPAL COMPONENT ANALYSIS (LWPCA) 
An alternative way of determining the parameters/ in a reduced space employs locally 
weighted principal component analysis (LWPCA) in the joint data space. By defining the 
largest k+l principal components of the weighted covariance matrix of Z as U: 
U=[eigertvectors(Zwi(zi_Xzi_)T/Zwi)]max(l:k+l) (4) 
and noting that the eigenvectors in U are unit length, the matrix inversion theorem (Horn 
& Johnson, 1994) provides a means to derive an efficient estimate of/ 
[Ux(= n x k)] 
/ = U(Uy r- ( I)-' where [Uy(=mxk)J 
Uy T UyUy T - VyUyT 1 e-- 
In our one dimensional output case, Uy is just a (1 x k)-dimensional row vector and the 
evaluation of (5) does not require a matrix inversion anymore but rather a division. 
If one assumes normal distributions in all variables as in LWFA, LWPCA is the special 
case of LWFA where the noise covariance f is spherical, i.e., the same magnitude of 
noise in all observables. Under these circumstances, the subspaces spanned by U in both 
methods will be the same. However, the regression coefficients of LWPCA will be dif- 
ferent from those of LWFA unless the noise level is zero, as LWFA optimizes the coeffi- 
cients according to the noise in the data (Equation (3)). Thus, for normal distributions 
and a correct guess of k, LWPCA is always expected to perform worse than LWFA. 
636 $. $chaal, $. Vijayakumar and C. G. Atkeson 
2.3 PARTIAL LEAST SQUARES (LWPLS, LWPLS_I) 
Partial least squares (Wold, 1975; Frank & Friedman, 1993) recursively computes or- 
thogonal projections of the input data and performs single variable regressions along 
these projections on the residuals of the previous iteration step. A locally weighted ver- 
sion of partial least squares (LWPLS) proceeds as shown in Equation (6) below. 
As all single variable regressions are ordinary uni- For Training: For Lookup: 
variate least-squares minim izations, LWPLS Initialize: Initialize: 
makes the same statistical assumption as ordinary 
linear regressions, i.e., that only output variables Do = X, e 0 = y d o = x, y = 0 
have additive noise, but input variables are noise- For i = 1 to k: For i = 1 to k: 
less. The choice of the projections u, however, in- 
troduces an element in LWPLS that remains staffs- ui = D,'r-Wei- s = dr_u 
tically still debated (Frank & Friedman, 1993), al- s = Di_u i y = y +/3is  
though, interestingly, there exists a strong similar- srWei_ d = d4 -sip 
ity with the way projections are chosen in Cascade /3 i _ 
Correlation (Fahlman & Lebiere, 1990). A peculi- srWs 
arity of LWPLS is that it also regresses the inputs Dr_ws 
of the previous step against the projected inputs s Pi - srWs 
in order to ensure the orthogonality of all the pro- 
jections u. Since LWPLS chooses projections in a D = D_ - sip r (6) 
very powerful way, it can accomplish optimal 
function fits with only one single projections (i.e., 
k=-l) for certain input distributions. We will address this issue in our empirical evalua- 
tions by comparing k-step LWPLS with 1-step LWPLS, abbreviated LWPLS_I. 
2.4 PRINCIPAL COMPONENT REGRESSION (LWPCR) 
Although not optimal, a computationally efficient techniques of dimensionality reduction 
for linear regression is principal component regression (LWPCR) (Massy, 1965). The in- 
puts are projected onto the largest k principal components of the weighted covariance 
matrix of the input data by the matrix U: 
g=[eigertvectoFs(Zwi(xi-XXi-)T/Ewi)]max(l:k) 
The regression coefficients/3 are thus calculated as: 
= (UrXrWXU) UrXrWy 
(7) 
(8) 
Equation (8) is inexpensive to evaluate since after projecting X with U, UrXrWXU be- 
comes a diagonal matrix that is easy to invert. LWPCR assumes that the inputs have ad- 
ditive spherical noise, which includes the zero noise case. As during dimensionality re- 
duction LWPCR does not take into account the output data, it is endangered by clipping 
input dimensions with low variance which nevertheless have important contribution to 
the regression output. However, from a statistical point of view, it is less likely that low 
variance inputs have significant contribution in a linear regression, as the confidence 
bands of the regression coefficients increase inversely proportionally with the variance of 
the associated input. If the input data has non-spherical noise, LWPCR is prone to focus 
the regression on irrelevant projections. 
3 MONTE CARLO EVALUATIONS 
In order to evaluate the candidate methods, data sets with 5 inputs and 1 output were ran- 
domly generated. Each data set consisted of 2,000 training points and 10,000 test points, 
distributed either uniformly or nonuniformly in the unit hypercube. The outputs were 
Local Dimensionality Reduction 637 
generated by either a linear or quadratic function. Afterwards, the 5-dimensional input 
space was projected into a 10-dimensional space by a randomly chosen distance pre- 
serving linear transformation. Finally, Gaussian noise of various magnitudes was added 
to both the 10-dimensional inputs and one dimensional output. For the test sets, the addi- 
tive noise in the outputs was omitted. Each regression technique was localized by a 
Gaussian kernel (Equation (1)) with a 10-dimensional distance metric D=10*I (D was 
manually chosen to ensure that the Gaussian kernel had sufficiently many data points and 
no "data holes" in the fringe areas of the kernel). The precise experimental conditions 
followed closely those suggested by Frank and Friedman (1993): 
 2 kinds of linear functions y = ]3irx for: i) ]3i, . =[1,1,1,1,1] r, ii) ]3i. =[1,2,3,4,5] r 
T T 2 2 2 2 2 T 
 2 kinds of quadratic functions y = ]3ii.x + ]3,a[xl, x2, x3, x4, xs ] for: 
i) 13i .= [1,1,1,1,1] r and ]3qua = 0.111,1, 1,1,1] r, and ii) 13z .= [1,2,3,4,5] r and 13q, -- 0.111,4,9,16,25] r 
 3 kinds of noise conditions, each with 2 sub-conditions: 
i) only output noise: a) low noise: local signal/noise ratio lsnr=20, 
and b) high noise: lsnr=2, 
ii) equal noise in inputs and outputs: 
a) low noise ex,.= y = N(0,0'012), n [1,2 ..... 10], 
and b) high noise ex..= . = N(0,0.12), n [1,2 ..... 10], 
iii) unequal noise in inputs and outputs: 
a) low noise: e,.= N(0,(0.01n)2), n [1,2 ..... 10] and lsnr=20, 
and b) high noise: e,.= N(0,(0.0ln)2), n [1,2,...,10] and lsnr=2, 
 2 kinds of input distributions: i) uniform in unit hyper cube, ii) uniform in unit hyper cube excluding data 
points which activate a Gaussian weighting function (1) at c = [0.5,0,0,0,0] r with D=10*I more than 
w=0.2 (this forms a "hyper kidney" shaped distribution) 
Every algorithm was ran* 30 times on each of the 48 combinations of the conditions. 
Additionally, the complete test was repeated for three further conditions varying the di~ 
mensionality called factors in accordance with LWFA that the algorithms assumed to 
be the true dimensionality of the 1 O-dimensional data from k--4 to 6, i.e., too few, correct, 
and too many factors. The average results are summarized in Figure 1. 
Figure 1 a,b,c show the summary results of the three factor conditions. Besides averaging 
over the 30 trials per condition, each mean of these charts also averages over the two in- 
put distribution conditions and the linear and quadratic function condition, as these four 
cases are frequently observed violations of the statistical assumptions in nonlinear func- 
tion approximation with locally linear models. In Figure lb the number of factors equals 
the underlying dimensionality of the problem, and all algorithms are essentially per- 
forming equally well. For perfectly Gaussian distributions in all random variables (not 
shown separately), LWFA's assumptions are perfectly fulfilled and it achieves the best 
results, however, almost indistinguishable closely followed by LWPLS. For the "unequal 
noise condition", the two PCA based techniques, LWPCA and LWPCR, perform the 
worst since as expected they choose suboptimal projections. However, when violat- 
ing the statistical assumptions, LWFA loses parts of its advantages, such that the sum- 
mary results become fairly balanced in Figure 1 b. 
The quality of function fitting changes significantly when violating the correct number of 
factors, as illustrated in Figure I a,c. For too few factors (Figure 1 a), LWPCR performs 
worst because it randomly omits one of the principle components in the input data, with- 
out respect to how important it is for the regression. The second worse is LWFA: ac- 
cording to its assumptions it believes that the signal it cannot model must be noise, lead- 
ing to a degraded estimate of the data's subspace and, consequently, degraded regression 
results. LWPLS has a clear lead in this test, closely followed by LWPCA and LWPLS_I. 
Except for LWFA, all methods can evaluate a data set in non-iterative calculations. LWFA was mined with EM for maxi- 
mally 1000 iterations or until the log-likelihood increased less than l.e-10 in one iteration. 
638 S. SchaaI, S. Vijayakumar and C. G. Atkeson 
For too many factors than necessary (Figure 1 c), it is now LWPCA which degrades. This 
effect is due to its extracting one very noise contaminated projection which strongly in- 
fluences the recovery of the regression parameters in Equation (4). All other algorithms 
perform almost equally well, with LWFA and LWPLS taking a small lead. 
Only Output Equal Noise In all Unequal Noise In all 
Noise Inputs and Outputs Inputs and Outputs 
1 
u) 0.1- 
I- 
o 
tU 0.01- 
 0.001 
0.0001 
3=1,>O 3=1,>>O 1.>O l.r;>>O 3=1.>0 3=1.>>0 1.>O 1.>>O Jl,a>O Jl,a>>O 1.>O l.a>>O 
a) Regression Results with 4 Factors 
- . 
o.1 
o.ol 
o.oolll 
0.0001  
b) Regrssslon Results with 5 Factors 
0.01 
0.001 
0.0001 
c) Regression Results with 6 Factors 
1- 
 o.1- 
I.-- 
tU 0.01 
0.001 
0.0001 
d) Summary Results 
Figure 1: Average summary results of Monte Carlo experiments. Each chart is primarily 
divided into the three major noise conditions, of. headers in chart (a). In each noise con- 
dition, there are four further subdivision: i) coefficients of linear or quadratic model are 
equal with low added noise; ii) like i) with high added noise; iii) coefficients of linear or 
quadratic model are different with low noise added; iv) like iii) with high added noise. 
Refer to text and descriptions of Monte Carlo studies for further explanations. 
Local DimensionaIity Reduction 639 
4 SUMMARY AND CONCLUSIONS 
Figure 1 d summarizes all the Monte Carlo experiments in a final average plot. Except for 
LWPLS, every other technique showed at least one clear weakness in one of our "robust- 
ness" tests. It was particularly an incorrect number of factors which made these weak- 
nesses apparent. For high-dimensional regression problems, the local dimensionality, i.e., 
the number of factors, is not a clearly defined number but rather a varying quantity, de- 
pending on the way the generating process operates. Usually, this process does not need 
to generate locally low dimensional distributions, however, it often "chooses" to do so, 
for instance, as human ann movements follow stereotypic patterns despite they could 
generate arbitrary ones. Thus, local dimensionality reduction needs to find autonomously 
the appropriate number of local factor. Locally weighted partial least squares turned out 
to be a surprisingly robust technique for this purpose, even outperforming the statistically 
appealing probabilistic factor analysis. As in principal component analysis, LWPLS's 
number of factors can easily be controlled just based on a variance-cutoff threshold in in- 
put space (Frank & Friedman, 1993), while factor analysis usually requires expensive 
cross-validation techniques. Simple, variance-based control over the number of factors 
can actually improve the results of LWPCA and LWPCR in practice, since, as shown in 
Figure 1 a, LWPCR is more robust towards overestimating the number of factors, while 
LWPCA is more robust towards an underestimation. If one is interested in dynamically 
growing the number of factors while obtaining already good regression results with too 
few factors, LWPCA and, especially, LWPLS seem to be appropriate it should be 
noted how well one factor LWPLS (LWPLS_I) already performed in Figure 1 ! 
In conclusion, since locally weighted partial least squares was equally robust as local 
weighted factor analysis towards additive noise in both input and output data, and, 
moreover, superior when mis-guessing the number of factors, it seems to be a most fa- 
vorable technique for local dimensionality reduction for high dimensional regressions. 
Acknowledgments 
The authors are grateful to Geoffrey Hinton for reminding them of parfal least squares. This work was sup- 
ported by the ATR Human Information Processing Research Laboratories. S. Schaal's support includes the 
German Research Association, the Alexander von Humboldt Foundation, and the German Scholarship Founda- 
tion. S. Vijayakumar was supported by the Japanese Ministry of Education, Science, and Culture (Monbusho). 
C. G. Atkeson acknowledges the Air Force Office of Scientific Research grant F49-6209410362 and a National 
Science Foundation Presidential Young Investigators Award. 
References 
Atkeson, C. G., Moore, A. W., & Schaal, S, (1997a). 
"Locally weighted learning." Artificial Intelligence Re- 
view, 11, 1-5, pp. 11-73. 
Atkeson, C. G., Moore, A. W., & Schaal, S, (1997c). 
"Locally weighted learning for control." Artificial In- 
telligence Review, 11, 1-5, pp.75-113. 
Belsley, D. A., Kuh, E., & Welsch, R. E, (1980). Re- 
gression diagnostics: Identiing influential data and 
sources ofcollinearity. New York: Wiley. 
Everitt, B. S, (1984). An introduction to latent variable 
models. London: Chapman and Hall. 
Fahlman, S. E., Lebiere, C, (1990). "The cascade- 
correlation learning architecture." In: Touretzky, D. S. 
(Ed.), Advances in Neural Information Processing 
Systems II, pp.524-532. Morgan Kaufmann. 
Frank, I.E., & Friedman, J. H, (1993). "A statistical 
view of some chemometric regression tools." Tech- 
nometrics, 35, 2, pp.109-135. 
Geman, S., Bienenstock, E., & Doursat, R, (1992). 
"Neural networks and the bias/variance dilemma." 
Neural Computation, 4, pp. 1-58. 
Horn, R. A., & Johnson, C. R, (1994). Matrix analysis. 
Press Syndicate of the University of Cambridge. 
Jordan, M. I., & Jacobs, R, (1994). "Hierarchical mix- 
tures of experts and the EM algorithm." Neural Com- 
putation, 6, 2, pp. 181-214. 
Massy, W. F, (1965). "Principle component regression 
in exploratory statistical research." Journal of the 
American Statistical Association, 60, pp.234-246. 
Rubin, D. B., & Thayer, D. T, (1982). "EM algorithms 
for ML factor analysis." Psychornetrika, 47, 1, 69-76. 
Schaal, S., & Atkeson, C. G, (in press). "Constructive 
incremental learning from only local information." 
Neural Computation. 
Scott, D. W, (1992). Multivariate Density Estimation. 
New York: Wiley. 
Vijayakumar, S., & Schaal, S, (1997). "Local dimen- 
sionality reduction for locally weighted learning." In: 
International Conference on Computational Intelli- 
gence in Robotics and Automation, pp.220-225, Mon- 
teray, CA, July 10-11, 1997. 
Wold, H. (1975). "Soft modeling by latent variables: 
the nonlinear iterative partial least squares approach." 
In: Gani, J. (Ed.), Perspectives in Probability and Sta- 
tistics, Papers in Honour of M. S. Bartlett. Acai. Press. 
Xu, L., Jordan, M. I., & Hinton, G. E, (1995). "An al- 
ternative model for mixture of experts." In: Tesauro, 
G., Touretzky, D. S., & Leen, T. K. (Eds.), Advances in 
Neural Information Processing Systems 7, pp.633-640. 
Cambridge, MA: MIT Press. 
