On the Use of Projection Pursuit Constraints for 
Training Neural Networks 
Nathan Intrator* 
Computer Science Department 
Tel-Aviv University 
Ramat-Aviv, 69978 ISRAEL 
and 
Institute for Brain and Neural Systems, 
Browu University 
ninmah. au. ac. il 
Abstract 
XVe present a novel classification and regression method tha. t coin- 
bines exploratory projection pursuit (unSUl)ervised training) with pro- 
jection pursuit regression (supervised training), to yield a new family of 
cost/complexity penalty terms. Some improved generalization properties 
are demonstrated on real world problems. 
1 Introduction 
Parameter estimation becomes difficult in high-dimensional spaces due t.o the in- 
creasing sparseness of the data. Therefore, when a low dimeusional representation 
is embedded in the data, dinensionality reduction methods become useful. One 
such method - projection pursuit regression (Friedman and St. uetzle, 1981) (PPR) 
is capable of performing dimensionality reduction by composition, namely, it con- 
structs an approximation to the desired response function using a coinposition of 
lower dimensional smooth functions. These functions depend ou low dimensional 
projections through the data. 
*Research was supported by the National Science Foundation, the Artny Research Of- 
rice, and the Office of Naval Research. 
3 
4 Intrator 
When the dimensionality of the problem is in the thousands, even projection pur- 
suit methods are almost always over-parametrized, therefore, additional smoothing 
is needed for low variance estimation. Exploratory Projection Pursuit (Friedman 
and Tukey, 1974; Friedman, 1987) (EPP) may be useful for that. It searches in a 
high dimensional space for structure in the form of (semi) linear projections with 
constraints characterized by a projection index. The projection index may be con- 
sidered as a universal prior for a large class of problems, or may be tailored to a 
specific problem based on prior knowledge. 
In this paper, the general form of exploratory projection pursuit is formulated to be 
an additional constraint for projection pursuit regression. In particular, a hybrid 
combination of supervised and unsupervised artificial neural network (ANN) is de- 
scribed as a special case. In addition, a specific projection index that is particularly 
useful for classification (Intrator, 1990; Intrator and Cooper, 1992) is introduced in 
this context. A more detailed discussion appears in Intrator (1993). 
2 Brief Description of Projection Pursuit Reression 
Let (X, Y) be a. pair of random variables, X  R a, and Y  t?. The problem is to 
approximate the d dimeusional surface 
h'om n observations ( x  , y ), . . . , ( x, , y,, ). 
PPR tries to approximate a function f by a suni of ridge functions (,functions that 
are constant along lines) 
The fitting procedure alt, ernates between an estimation of a direction fi and an 
estimation of a smooth hmction g, uch that at iteration j, t, he sqnare average of 
the residuals 
is minimized. This process is init. ialized by setting rio = yi. Usually, the initial 
values of aj are taken to be the first few principal components of the data. 
Estimation of the ridge t'unctions can be achieved by various nonparametric smooth- 
ing techniques such a.s locally linear functions (Friedman and Stuetzle, 1981), 
k-nea. rest neighbors (Hall, 1989b), splines or variable degree polynomials. The 
smoothness constraint imposed on g, iml)lies that the actual projection pursuit 
is achieved by minimizing at. it. eratiou j, l.le suni 
+ 
for some smoothness measure C. 
Although PPR converges to the desired response function (Jones, 1987), the use 
of non-pa. rametric function estimation is likely to lead to overfitting. Recent re- 
sults (Hornik, 1991)suggest that a feed forward network architecture with a single 
On the Use of Projection Pursuit Constraints for Training Neural Networks 
hidden layer and a rather general fixed activation function is a universal approxi- 
mator. Therefore, the use of a non-parametric single ridge function estimation can 
be avoided. It is thus appropriate to concentrate on the estimation of good pro- 
jections. In the next section we present. a general h'amework of PPR architecture, 
and in section 4 we restrict it. to a feed-forward architecture xvith sigmoidal hidden 
units. 
3 
Estimating The Projections Using Exploratory 
Projection Pursuit 
Exploratory projection pursuit.is based on seeking interesting projections of high 
dimensional data points (Kruskal, 1969; Switzer, 1970; Kruskal, 1972; Friedman 
and Tukey, 1974; Friedman, 1987; Jones and Sibson, 1987; Hall, 1988; Huber, 1985, 
for review). The notion of interesting projections is motivated by an observation 
that for most high-dimensional data clouds, most low-dimensional projections are 
approximately normal (Diaconis and Froedmau, 1984). This finding suggests that 
the important information in the data is conveyed in those directions whose single 
dimensional projected distribution is far from Gaussiat. Various projection indices 
(measures for the goodness of a. projection) differ ou the assumptions abont the 
nature of deviation fi'om nornmlity, and in their computational efficiency. They can 
be considered as different priors motivated by specific assuniptions on the underlying 
model. 
To partially decouple the search for a projection vector from the search for a non- 
parametric ridge function, we propose to add a penalty term, which is based on 
a projection index, to the energy minimization associated with the estimation of 
the ridge functions and the projections. Specifically, let p(a) be a. projection index 
which is minimized for projections with a certain devia. tion from normality; At the 
j'th iteration, we minimize the sum 
p(a. ). 
i 
When a concurrent minimization over several projections/fuuctions is practical, we 
get a penalty term of the forin 
B(f) = 
J 
Since C and p may not be linear, the more general measure that does not assume a 
stepwise approach, but instead seeks l projections and ridge functions concurrently, 
is given by 
B(f) = C'(g .... ,gt) + p(a,...,at), 
In practice, p depends implicitly on t. he training data, (the empirical density) and 
is therefore replaced by its empirical measure 
3.1 Some Possible Measures 
Some applicable projection indices are discussed in (Huber, 1985; Jones and Sib- 
son, 1987; Friedman, 1987; Hall, 1989a; Intrator, 1990). Probably, all the possible 
6 Intrator 
measures should emphasize some form of deviation from normality bnt the spe- 
cific type may depend on the problem at hand. For example, a measure based 
on the I,;arhunen Leave expansiou (Mougeot et al., 1991) may be useful for image 
compression with autoassociative networks, since in this case one is interested in 
minimizing the L 2 norm of the distance between the reconstructed image and the 
original one, and under mild conditious, the Karhunen Lo/ve expansion gives the 
optimal solution. 
A different type of prior knowledge is required for classification problems. The 
underlying assumption then is that the data is clustered (when projecting in the 
right directions) and tha.[ the classification may be achieved by some (nonlinear) 
mapping of these clusters. Iu such a case, the projection index should emphasize 
multi-modality as a specific deviation from normality. A projection index that em- 
phasizes multimodalities in tile projected distribution (without relytug on the class 
labels) has recently been introduced (Intrator, 1990) and implemented efficiently us- 
ing a variant of a biologically motivated unsupervised uetwork (Intrator and Cooper, 
1992). Its integration into a back-propagation classifier will be discussed below. 
3.2 Adding EPP constraints to back-propagation network 
One xvay of adding sense prior knowledge into tile architecture is by minimizing 
the effective number of parameters using weight sharing, in which a single weight 
is shared among many connections in the network (Waibel et al., 1989; Le Cun 
et al., 1989). An extension of this idea is the "soft. weight sharing" which favors 
irregularities in the weight distribution in the form of multimodality (Nowlan and 
Hinton, 1992). This penalty improved generalization results obtained by weight 
elimination penalty. Both these methods make an explicit assumption about the 
structure of the weight space, but xvit. h no regard to the structure of the input space. 
As described in the coutext of projection pursuit regression, a penalty term may 
be added to the energy functioual lninimized by error back propagation, for the 
purpose of measuring directly the goodness of the projections songht by the network. 
Since our main interest is in reducing overfitting for high dinensional problems, our 
underlying assumption is that the stirface fuHction to be estimated can be faithfully 
represented using a lmv dimensiom.l composition of sigmoidal functions, namely, 
using a back-propagation network in which the number of hidden units is much 
smaller than the number of input units. Therefore, the penalty term may be added 
only to the hidden layer. The synal)tic modification equations of the hidden units' 
weights become 
Ow 0 _ OS ( u,, .r ) 
at - a,7,;- 
+(Contribution of cost/complexity tem]s)]. 
An approach of this type has been used in image compression, with a penalty 
aimed at minimizing th(' entropy of the projected distribution (Bichsel and Seitz, 
1989). This penalty certainly measures deviation from normality, since entropy is 
maximized for a Gaussian distribut, ion. 
On the Use of Projection Pursuit Constraints for Training Neural Networks 7 
4 
Projection Index for Classification: The Unsupervised 
BCM Neuron 
Intrator (1990) has recently shown that a variant of the Bienenstock, Cooper and 
Munro neuron (Bienenstock et al., 1982) performs exploratory projection pursuit 
using a projection index that measures nmlti-modality. This neuron version allows 
theoretical analysis of some visual deprivation experiments (Intrator and Cooper, 
1992), and is in agreement with the vt experimeutal results on visual cortical 
plasticity (Clothiaux et a.l., 1991). A network implementation which can find several 
projections in parallel while retaining its computational ecieucy, was found to be 
applicable for extracting features from very high dimensional vector spaces (Intrator 
and Gold, 1993; Intrator et al., 1991; Intrator, 1992) 
The activity of neuron k in the network is c = i xiwi + wo. The inhibited 
activity and threshold of the k'th neuron is given by 
. a(c. q  c ), - ' 
= - o. = 
The threshold O,, is the point, at which the modification function 0 changes sign 
(see Intrator and Cooper, 1992 for further details). The function 0 is given by 
0(c, O,,,) = c(c- 
The risk (projection index) for a single ueuron is given by 
The total risk is the sum of each local risk. The negative gradient of the risk that 
leads to the synaptic modification equations is given by 
This last equation is an additional peua[ty to the energy minimization of the super- 
vised network. Note that there is an interaction between adjacent neurons in the 
hidden layer. In practice, the st. ochastic version of the differential equation can be 
used  the learning rule. 
5 Applications 
We have applied this hybrid classification method to varions speech and image 
recognition problems in high dimensional space. In one speech application we used 
voiceless stop consonants extracted from the TIMIT database as training tokens 
(Intrator and Tajchman, 1991). A detailed biologically motivated speech represen- 
tation was produced by Lyon's cochlear model (Lyon, 1982; Slaney, 1988). This 
representation produced 5040 dimensions (84 channels x 60 time slices). In ad- 
dition to an initial voiceless stop, each token contained a final vowel from the set 
[aa, ao, er, iy]. Classification of the voiceless stop consonants using a test set that 
included 7 vowels [uh, ih, eh, ae, ah, uw, ow] produced an average error of 18.8% 
8 Intrator 
while on the same task classification using back-propagation network produced an 
average error of 20.9% (a significant difference, P < .0013). Additional experiments 
on vowel tokens appear in Tajchman and Intrator (1992). 
Another application is in the area of face recognition froin gray level pixels (Intrator 
et al., 1992). After aligning and normalizing the images, the input was set to 37 
x 62 pixels (total of 2294 dimensions). The recognition performance was tested on 
a snbset of the MIT Media Lab database of face images made available by Turk 
and Pentland (1991) which contained 27 face images of each of 16 different persons. 
The images were taken under varying illumination and camera location. Of the 27 
images available, 17 randomly chosen ones served for training and the remaining 
10 were used for testing. Using an ensemble average of hybrid networks (Lincoln 
and Skrzypek, 1990; Pearlmutter and Rosenreid, 1991; Pertone and Cooper, 1992) 
we obtained an error rate of 0.62% as opposed to 1.2% using a similar ensemble of 
back-prop networks. A single back-prop network achieves an error between 2.5% to 
6% on this data. The experiments were done using 8 hidden units. 
6 Sulnmary 
A penalty that allows the incorporation of additional prior in[brmatiou on the un- 
derlying model was presented. This prior was introduced in the context of projection 
pursuit regression, classification, and in the context of back-l)ropagation network. 
It achieves partial alecoupling ot' estinat. ion of the ridge functions (in PPR) or the 
regressiou function in back-propagation net from the estimation of the projections. 
Thus it is potentially usefnl in reducing problems a.ssociated with overfitting which 
are more pronounced in high dimensional data. 
Some possible projection indices were discussed and a specific projection index that 
is particularly useful for classification was presented in this context. This measure 
that emphasizes multi-modality in the projected distribution, was found useful in 
several very high dimensional problems. 
6.1 Acknowledgments 
I wish to thank Leon Cooper, Stu Geman and Michael Pertone for many fruitful 
conversations and to the referee for helpfi[ comments. The speech experiments were 
performed using the computational facilities of the Coguitive Science Department 
at Brown University. Research was supported by the National Science Foundation, 
the Army Research Office, and the Office of Naval Research. 
References 
Bichsel, M. and Seitz, P. (1989). Minimunn class entropy: A maximuni information ap- 
proach to layered netowrks. Neural Networks, 2:133-141. 
Bienenstock, E. L., Cooper, L. N., and Munro, P. W. (1982). Theory for the development 
of 11etlroll selectivity: orientation specificity and binocular interaction in visual cortex. 
Journal Nc,rosci,cc. 2:32-48. 
On the Use of Projection Pursuit Constraints for Training Neural Networks 9 
Clothiaux, E. E., Cooper, L. N., and Bear, M. F. (1991). Synaptic plasticity in visual 
cortex: Comparison of theory with experiment. Journal o*f Neurophysiology, 66:1785- 
1804. 
Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection. pursuit. Annals 
o*f Statistics, 12:793-815. 
Friedman, J. H. (1987). Exploratory projection pursuit. Journal o*f the American Statistical 
Association, 82:249-266. 
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal o*f the 
American Statistical Association, 76:817-823. 
Friedman, J. H. and Tnkey, J. W. (1974). A projection pursuit algorithm for exploratory 
data analysis. IEEE Transactions on Computers, C(23):881-889. 
Hall, P. (1988). Estimating the direction in which data set is most interesting. Probah. 
Theory Rel. Fields, 80:51-78. 
Hall, P. (1989a). On polynomial-based projection indices for exploratory projection pur- 
suit.. The Annals o.f Statistics, 17:589-605. 
Hall, P. (1989b). On projection pursuit regression. The Annals o*f Statistics, 17:573-588. 
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural 
Networks, 4:251-257. 
Huber, P. J. (1985). Projection pursuit. (with discussion). Tile Anals o*f Statistics, 
13:435-475. 
Intrator, N. (1990). Fealure extraction using an unsnpervised neural network. In Touret- 
zky, D. S., Elhnan, J. L., Sejnowski, T. J., and ttinton, (4. E.. editors, Proceedings o*f 
the 1990 Connectionist Models Summer School, pages 310-318. Morgan I(aufmann, 
San Mateo, CA. 
Intrator, N. (1992). Feature extraction using an unsupervised neural network. Neural 
Computation, 4:98-107. 
Intrator, N. (1993). Combining exploratory projection pursuit and projection pursuit 
regression with application to neural networks. Neural Computation. In press. 
Intrator, N. and Cooper, L. N. (1992). Objective function formulation of the BCM the- 
ory of visual cortical plasticity: Statistical connections, stability conditions. Neural 
Networks, 5:3-17. 
Intrator, N. and (4old, J. I. (1993). Three-dimensional object recognition of gray level 
images: The usefuhless of distinguishing features. Neural Uomputation. In press. 
Intrator, N., Gold, J. I., Bfilthoff, H. H., and Edehnan, S. (1991). Three-dimensional object 
recognition using an unsupervised neural network: Understanding the distinguishing 
features. In Feldman, Y. and Bruckstein, A., editors, Pvceedings o*f the 8th Israeli 
Con.fereace on AICV, pages 113-123. Elsevier. 
Intrator, N., Reisfeld, D., and Yesllurnn, Y. (1992). Face recognition using a hybrid 
supervised/unsupervised neural network. Preprint. 
Intrator, N. and Tajchman, G. (1991). Supervised and unsupervised feature extraction 
from a cochlear model for speech recognition. In Juang, B. H., Kung, S. Y., and 
Kamm, C. A., editors, Neural Networks .for Signal Processing - Proceedings o] the 
1991 IEEE Workshop, pages 460-469. IEEE Press, New York, NY. 
Jones, L. (1987). On a conjecture of huber concerning the convergence of projection pursuit 
regression. Annals o*f Statistics, 15:880-882. 
Jones, M. C. and Sibson, R. (1987). What is projection pursuii? (with discussion). J. 
Roy. Statist. Soc., Set. A(150):1-36. 
10 Intrator 
Kruskal, J. B. (1969). Ibward a practical method which helps uncover tile structure of the 
set of multivariate observat. ions by finding tile linear transformation which optimizes 
a new 'index of condensat. ion'. In Milton, R. C. and Nelder, J. A., editors, Statistical 
Compttation, pages 427-440. Academic Press, New York. 
Kruskal, J. B. (1972). Linear transformation of multivariate data to reveal clustering. In 
Shepard, R. N., Romney, A. K., and Nerlove, S. B., editors, Multidimensional Scaling: 
Theorgt and Application in the Behavioral Sciences, I, Theorgt, pages 179-191. Seminar 
Press, New York and London. 
Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, 
L. (1989). Backpropagation applied to handwritten zip code recognition. Neural 
Computation, 1:541-551. 
Lincoln, W. P. and Skrzypek, J. (1990). Synergy of clustering multiple back-propagation 
networks. In Touretzky. D. S. and Lippmann, R. P., editors, Advances in Neural 
Information Pvcessing Systems, volume 2, pages 650-657. Morgan Kaufinann, San 
Mateo, CA. 
Lyon, R. F. (1982). A computat. ional model of filtering, detect. ion, and compression in 
the cochlea. In Proceedings IEEE btertational Uonfereace on Acoustics, Speech, and 
Signol Processing, Paris, France. 
Mougeot, M., Azencott, R., and Angeniol, B. (1991). hnage compression with back prop- 
agation: Improvement of i. he visual restoralion using differenl. cost functions. Neural 
Networks, 4:467-476. 
Nowlan, S. J. and Hinton, G. E. (1992). Simplif,ing neural networks by soft. weight-sharing. 
Neural Computation. In press. 
Pearlmutter, B. A. and Rosenreid, R. (1991). Chailin-kohnogorov complexity and gen- 
eralizatiou in eural networks. In Lippmann, R. P., Moody. J. E., and Touretzky, 
D. S., editors, Advaces in Neural lnformatio Processbg Systems, volume 3, pages 
925-931. Morgan Kaufmann, San b, lat. eo, CA. 
Perrone, M.P. and Cooper, L. N. (1992). When networks disagree: Generalized ensemble 
method for neural networks. ht Mammone, R. J. and Zeevi, Y., editors, Neural 
Networks: Theory at, d Applications, volume 2. Academic Press. 
Slaney, M. (1988). Lyon's cochlear model. Technical report., Apple Corporate Library, 
Cupertino, CA 95014. 
Switzer, P. (1970). Numerical classificatiou. in Bariteft, V., editor, Geostatistics. Plenum 
Press, New York. 
Tajchman, G. N. and lntrttor, N. (1992). Phonetic classification of TIMIT segments pre~ 
processed with lyon's cochlear model using a supervised/unsupervised hybrid neural 
network. In Proccdiags l,teratioal Uofcrcncc o Spoke Language Processing, 
Banff, Alberta, Caada. 
Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. J. of Cognitive Neuroscience, 
3: 1--86. 
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, I(. (1989). Phoneme 
recognition using time-delay neural networks. IEEE Transactions on ASSP, 37:328- 
339. 
