Bayesian Backpropagation Over I-O Functions 
Rather Than Weights 
David H. Wolpert 
The Santa Fe Institute 
1660 Old Pecos Trail 
Santa Fe, NM 87501 
Abstract 
The conventional Bayesian justification of backprop is that it finds the 
MAP weight vector. As this paper shows, to find the MAP i-o function 
instead one must add a correction term to backprop. That term biases one 
towards i-o functions with small description lengths, and in particular fa- 
vors (some kinds of) feature-selection, pruning, and weight-sharing. 
INTRODUCTION 
In the conventional Bayesian view of backpropagation (BP) (Buntine and Weigend, 1991; 
Nowlan and Hinton, 1994; MacKay, 1992; Wolpert, 1993), one starts with the "likelihood" 
conditional distribution P(training set = t I weight vector w) and the "prior" distribution 
P(w). As an example, in regression one might have a "Gaussian likelihood", P(t I w) o, 
exp[-z2(w, 0] -- IIi exp [-{net(w, tx(i)) - ty(i)} 2 / 202] for some constant c. (tx(i) and ty(i) 
are the successive input and output values in the training set respectively, and net(w, .) is 
the function, induced by w, taking input neuron values to output neuron values.) As another 
example, the "weight decay" (Gaussian) prior is P(w) o, exp(.a(w2)) for some constant a. 
Bayes' theorem tells us that P(w I 0 o P(t I w) P(w). Accordingly, the most probable weight 
given the data - the "maximum a postefiofi" (MAP) w - is the mode over w of P(t I w) P(w), 
which equals the mode over w of the "cost function" L(w, 0 -- ln[P(t I w)] + ln[P(w)]. So 
for example with the Gaussian likelihood and weight decay prior, the most probable w giv- 
en the data is the w minimizing Z2(w, 0 + ct w2. Accordingly BP with weight decay can be 
viewed as a scheme for trying to find the function from input neuron values to output neu- 
ron values (i-o function) induced by the MAP w. 
200 
Bayesian Backpropagation over I-O Functions Rather Than Weights 201 
One peculiar aspect of this justification of weight-decay BP is the fact that rather than the 
i-o function induced by the most probable weight vecwr, in practice one would usually pre- 
fer to know the most probable i-o function. (In few situations would one care more about a 
weight vector than about what that weight vector parameterizes.) Unfortunately, the differ- 
ence between these two i-o functions can be large; in general it is not true that "the most 
probable output corresponds to the most probable parameter" (Denker and LeCun, 1991). 
This paper shows that to find the MAP i-o function rather than the MAP w one adds a "cor- 
rection term" to conventional BP. That term biases one towards i-o functions with small 
description lengths, and in particular favors feature-selection, pruning and weight-sharing. 
In this that term constitutes a theoretical justification for those techniques. 
Although cast in terms of neural nets, this paper's analysis applies to any case where con- 
vention is to use the MAP value of a parameter encoding Z to estimate the value of Z. 
2 
BACKPROP OVER I-O FUNCTIONS 
Assume the net's architecture is fixed, and that weight vectors w live in a Euclidean vector 
space W of dimension IWl. Let X be the set of vectors x which can be loaded on the input 
neurons, and O the set of vectors o which can be read off the output neurons. Assume that 
the number of elements in X (IXl) is finite. This is always the case in the real world, where 
measuring devices have finite accuracy, and where the computers used to emulate neural 
nets are finite state machines. For similar reasons O is also finite in practice. However for 
now assume that O is very large and "fine-grained", and approximate it as a Euclidean vec- 
tor space of dimension IOI. (This assumption usually holds with neural nets, where output 
values are treated as real-valued vectors.) This assumption will be relaxed later. 
Indicate the set of functions taking X to O by . (net(w, .) is an element of .) Any    
is an (IXl x IOI)-dimensional Euclidean vector. Accordingly, densities over W are related 
to densities over  by the usual rules for transforming densities between IWl-dimensional 
and (IXl x IOI)-dimensional Euclidean vector spaces. There are three cases to consider: 
1) IWl < IXl IOI. In general, as one varies over all w's the corresponding i-o func- 
tions net(w, .) map out a sub-manifold of  having lower dimension than . 
2) IWl > IXl IOI. There are an infinite number of w's corresponding to each . 
3) IWl - IXl IOI. This is the easiest case to analyze in detail. Accordingly I will deal 
with it first, deferring discussion of cases one and two until later. 
With some abuse of notation, let capital letters indicate random variables and lower case 
letters indicate values of random variables. So for example w is a value of the weight vector 
random variable W. Use 'p' to indicate probability densities. So for example Pn, IT( I t) is 
the density of the i-o function random variable , conditioned on the training set random 
variable T, and evaluated at the values  =  and T = t. 
In general, any i-o function not expressible as net(w, .) for some w has zero probability. For 
the other i-o functions, with 8(.) being the multivariable Dirac delta function, 
pn,(net(w, .)) = ldw' Pw(W') 8(net(w', .) - net(w, .)). 
When the mapping  = net(W, .) is one-to-one, we can evaluate equation (1) to get 
pT(net(w, .)l t) = PwiT(W I t) / Jl,,w(W), 
where Jn,,w(W) is the Jacobian of the W --)  mapping: 
(1) 
(2) 
202 Wolpert 
Jtl,,w(W) -- I det[/I i //}Wj ](w) I= I det[/} net(w, -)i //}wj ] I. 0) 
"net(w, .)i"means the i'th component of the i-o function net(w, .). "net(w, x)" means the 
vector o mapped by net(w, .) from the input x, and "net(w, X)k"is the k'th component of o. 
So the "i" in "net(w, .)i" refers to a pair of values {x, k}. Each matrix value  //)wj is the 
partial derivative of net(w, X)k with respect to some weight, for some x and k. Jq,,w(w) can 
be rewduen as der it2 [gij(w)], where gij(w) -- Ek [( / wi) OR / wj)] is the metric of 
the W -->  mapping. This form of Jtl,.w(w) is usually more difficult to evaluate though. 
Unfortunately, { = net(w, .) is not one-to-one; where Jn,,w(W) ,e 0 the mapping is locally 
one-to-one, but there are global symmetries which ensure that more than one w corre- 
sponds to each {. (Such symmetries arise from things like permuting the hidden neurons 
or changing the sign of all weights leading into a hidden neuron - see (Fefferman, 1993) 
and references therein.) To circumvent this difficulty we must make a pair of assumptions. 
To begin, restrict attention to Winj, those values w of the variable W for which the Jaco- 
bian is non-zero. This ensures local injectivity of the map between W and . Given a par- 
ticular w  Winj, let k be the number of w'  Winj such that net(w, .) = net(w', .). (Since 
net(w, .) = net(w, .), k > 1.) Such a set of k vectors form an equivalence class, {w}. 
The first assumption is that for all w  Winj the size of {w} (i.e., k) is the same. This will 
be the case if we exclude degenerate w (e.g., w's with all first layer weights set to 0). The 
second assumption is that for all w' and w in the same equivalence class, PwlD (w I d) = 
PwI) (w' I d). This is usually the case. (For example, start with w' and telabel hidden neu- 
rons to get a new w  {w'}. If we assume the Gaussian likelihood and prior, then since 
neither differs for the two w's the weight-posterior is also the same for the two w's.) Given 
these assumptions, plT(net(w, .) I t) = k PWlT(W I t) / J.w(W). So rather than minimize the 
usual cost function, L(w, t), to find the MAP  BP should minimize L'(w, 0 -= L(w, t) + 
In[ Jn,,w(W) ]. The In[ Jn,.w(W) ] term constitutes a correction term to conventional BP. 
One should not confuse the correction term with the other quantities in the neural net liter- 
ature which involve partial derivative matrices. As an example, one way to characterize 
the "quality" of a local peak w' of a cost function involves the Hessian of that cost function 
(Buntine and Weigend, 1991). The correction term doesn't direcfiy concern the validity of 
such a Hessian-based quality measure. However it does concern the validity of some 
implementations of such a measure. In particular, the correction term changes the location 
of the peak w'. It also suggests that a peak's quality be measured by the Hessian of 
L'(w', 0 with respect to , rather than by the Hessian of L(w', 0 with respect to w. (As an 
aside on the subject of Hessians, note that some workers incorrectly use Hessians when 
they attempt to evaluate quantities like output-variances. See (Wolpert, 1994).) 
If we stipulate that the PqT({ I t) one encounters in the real world is independent of how 
one chooses to parameterize , then the probability density of our parameter must depend 
on how it gets mapped to . This is the basis of the correction term. As this suggests, the 
correction term won't arise if we use non-pqT(  I O-based estimators, like maximum-like- 
lihood estimators. (This is a basic difference between such estimators and MAP estimators 
with a uniform prior.) The correction term is also irrelevant if it we use an MAP estimate 
but J,t,,w(W) is independent of w (as when net (w, .) depends linearly on w). And for non- 
linear net(w, .), the correction term has no effect for some non-MAP-based ways to apply 
Bayesianism to neural nets, like guessing the posterior average  (Neal, 1993): 
Bayesian Backpropagation over I-O Functions Rather Than Weights 203 
E(, 10 -- $d paT( I 0  = ldw PWlT(W I 0 net(w, .), 
(4) 
so one can calculate E(d> I t) by working in W, without any concern for a correction term. 
(Loosely speaking, the Jacobian associated with changing integration variables cancels the 
Jacobian associated with changing the argument of the probability density. A formal deri- 
vation - applicable even when IWl ,e IXI x IOI - is in the appendix of (Wolpert, 1994).) 
One might think that since it's independent of t, the correction term can be absorbed into 
Pw(W). Ironically, it is precisely because quantities like E(d> I t) aren't affected by the cor- 
rection term that this is impossible: Absorb the correction term into the prior, giving a new 
prior p*w(W)  d x pw(w) x Jn,,w(w) (asterisks refers to new densities, and d is a normal- 
ization constan 0. Then p*n, iT(net(w, .) I t) = PWlT(W I 0. So by redefining what we call the 
prior we can justify use of conventional uncorrected BP; the (new) MAP { corresponds to 
the w minimizing L(w, 0. However such a redefinition changes E(d> I t) (amongst other 
things): ld{ P*n, IT({ It) q = /dw p*WT(W I t) net(w, .) , ldw Pwrr(w I t) net(w, .) = 
ld{ PqT({ I 0 {. So one can either modify BP (by adding in the correction term) and leave 
E(d> I t) alone, or leave BP alone but change E(d> I 0; one can not leave both unchanged. 
Moreover, some procedures involve both prior-based modes and prior-based integrals, and 
therefore are affected by the correction term no mauer how pw(w) is redefined. For exam- 
ple, in the evidence procedure (Wolpert, 1993; MacKay, 1992) one fixes the value of a 
hyperparameter F (e.g., ct from the introduction) to the value y maximizing PrlT(Y I 0. 
Next one find the value s' maximizing PSlT, r' (s' I t, y) for some variable S. Finally, one 
guesses the { associated with s'. Now it's hard to see why one should use this procedure 
with S = W (as is conventional) rather than with S = d>. But with S = d> rather than W, one 
must factor in the correction term when calculating PSIT, F (S I t, ), and therefore the 
guessed { is different from when S = W. If one tries to avoid this change in the guessed { 
by absorbing the correction term into the prior Pwlr(w I ht), then Prl T(h t I 0 - which is given 
by an integral involving that prior - changes. This in turn changes y, and therefore the 
guessed { again is different. So presuming one is more directly interested in d> rather than 
W, one can't avoid having the correction term affect the evidence procedure. 
It should be noted that calculating the correction term can be laborious in large nets. One 
should bear in mind the determinant-evaluation tricks mentioned in (Bunfine and 
Weigend, 1991), as well as others like the identity In[ Jn,,w(w) ] = Tr(ln[  //}wj ]) -_- 
Tr(ln*[ /wj ]), where In*(.) is In(.) evaluated to several orders. 
3 
EFFECTS OF THE CORRECTION TERM 
To illustrate the effects of the correction term, consider a perceptton with a single output 
neuron, N input neurons and a unary input space: o = tanh(w  x), and x always consist of 
a single one and N - 1 zeroes. For this scenario {i //}vj is an N x N diagonal matrix, and 
In[ Jn,.w(w) ] TM -2 EkN=l In[ cosh(wk) ]. Assume the Gaussian prior and likelihood of the 
introduction, and for simplicity take 2o 2 = 1. Both L(w, t) and L'(w, 0 are sums of terms 
each of which only concerns one weight and the corresponding input neuron. Accord- 
ingly, it suffices to consider just the i'th weight and the corresponding input neuron. 
Let x(i) be the input vector which has its 1 in neuron i. Let oj(i) be the output of the j'th 
of the pairs in the training set with input x(i), and m i the number of such pairs. With a = 0 
204 Wolpert 
(no weight decay), L(w, t) = X2(t, w), which is minimized by w' i = tanh '1 [ ji 1 oj(i) / mi]. 
If we insWad try to minimize Z2(t, w) + Jkw(w) though, then for low enough m i (e.g., m i 
= 1), we find that there is no minimum. The correction term pushes w away from 0, and 
for low enough m i the likelihood isn't strong enough to counteract this push. 
o o 
Figures 1 through 3: Train using unmodified BP on training set t, and feed input x into 
the resultant net. The horizontal axis gives the output you get. If t and x were still used 
but training had been with modified BP, the output would have been the value on the 
vertical axis. In succession, the three figures have c = .6, .4, .4, and m = 1, 4, 1. 
i I 
, , I . , . . I 
0 1 
Figure 3. 
zo 
0 S 10 
Figure 4: The horizontal axis is Iwi I. The top 
curve depicts the weight decay regularizer, 
cwi 2, and the bottom curve shows that regu- 
larizer modified by the correction term.  = .2. 
When weight-decay is used though, modified BP finds a solution, just like unmodified BP 
does. Since the correction term "pushes out" w, and since tanh(.) grows with its argument, 
a  found by modified BP has larger (in magnitude) values of o than does the correspond- 
ing  found by unmodified BP. In addition, unlike unmodified BP, modified BP has multi- 
ple extrema over certain regimes. All of this is illustrated in figures (1) through (3), which 
graph the value of o resulting from using modified BP with a particular training set t and 
input value x vs. the value of o resulting from using unmodified BP with t and x. Figure 
(4) depicts the wi-dependences of the weight decay term and of the weight-decay term 
plus the correction term. (When there's no data, BP searches for minima of those curves.) 
Now consider multi-layer nets, possibly with non-unary X. Denote a vector of the compo- 
Bayesian Backpropagation over I-O Functions Rather Than Weights 205 
nents of w which lead from the input layer into hidden neuron K by wtg]. Let x' be the 
input vector consisting of all O's. Then  tanh(wtg ]  x') / wj = 0 for any j, w, and K, and 
for any w, there is a row of  / vj which is all zeroes. This in turn means that Jn,,w(W) = 
0 for any w, which means that Win j is empty, and PIT( I t) is independent of the data t. 
(Intuitively, this problem arises since the o corresponding to x' can't vary with w, and 
therefore the dimension of  is less than IWl) So we must forbid such an all-zeroes x'. The 
easiest way to do this is to require that one input neuron always be on, i.e., introduce a bias 
unit. An alternative is to redefine  to be the functions from the set {X - (0, 0, ..., 0)} to O 
rather than from the set X to O. Another alternative, appropriate when the original X is the 
set of all input neuron vectors consisting of 0's and 1 's, is to instead have input neuron val- 
ues  {z  0, 1}. (In general z  -1 though; due to the symmetries of the tanh, for many 
architectures z = -1 means that two rows of  / wj are identical up to an overall sign, 
which means that Jn,,w(W) = 0.) This is the solution implicitly assumed from now on. 
J,w(w) will be small - and therefore p(net(w, .)) will be large - whenever one can make 
large changes to w without affecting  = net(w, .) much. In other words, p(net(w, .)) will 
be large whenever we don't need to specify w very accurately. So the correction factor 
favors those w which can be expressed with few bits. In other words, the correction factor 
enforces a sort of automatic MDL (Rissanen, 1986; Nowlan and Hinton, 1994). 
More generally, for any multi-layer architecture there are many "singular weights" wsi n  
Win j such that Jn,.w(Wsin) is not just small but equals zero exactly. pw(w) must compen- 
sate for these singularities, or the peaks of PT({ I t) won't depend on t. So we need to 
have Pw(W) --> 0 as w --> wsi n. Sometimes this happens automatically. For example often 
Wsin includes infinite-valued w's, since tanh'(oo) = 0. Because pw(oo) = 0 for the weight- 
decay prior, that prior compensates for the infmite-w singularities in the correction tenn. 
For other wsin there is no such automatic compensation, and we have to explicitly modify 
pw(w) to avoid singularities. In doing so though it seems reasonable to maintain a "bias" 
towards the wsin, that Pw(W) goes to zero slowly enough so that the values pn,(net(w, .)) 
are "enhanced" for w near wsi n. Although a full characterization of such enhanced w is not 
in hand, it's easy to see that they include certain kinds of pruned nets (Hassibi and Stork, 
1992), weight-shared nets (Nowlan and Hinton, 1994), and feature-selected nets. 
To see that (some kinds of) pruned nets have singular weights, let w* be a weight vector 
with a zero-valued weight coming out of hidden neuron K. By (1) pn,(net(w*, .)) = 
ldw' pw(w) 5(net(w', .) - net(w*, .)). Since we can vary the value of each weight w* i lead- 
ing into neuron K without affecting net(w*, .), the integral diverges. So w* is singular; 
removing a hidden neuron results in an enhanced probability. This constitutes an a priori 
argument in favor of trying to remove hidden neurons during training. 
This argument does not apply to weights leading into a hidden neuron; Jq,,w(w) treats 
weights in different layers differently. This fact suggests that however Pw(W) compen- 
sates for the singularities in Jn,,w(W), weights in different layers should be treated differ- 
ently by pw(w). This is in accord with the advice given in (MacKay, 1992). 
To see that some kinds of weight-shared nets have singular weights, let w' be a weight vec- 
tor such that for any two hidden neurons K and K' the weight from input neuron i to K 
equals the weight from i to K', for all input neurons i. In other words, w is such that all hid- 
206 Wolpert 
den neurons compute identical functions of x. (For some architectures we'll actually only 
need a single pair of hidden neurons to be identical.) Usually for such a situation there is a 
pair of columns of the matrix b0i / wj which are exactly proportional to one another. (For 
example, in a 3-2-1 architecture, with X = {z, 1} 3, IWl -- IXl x IOI -- 8, and there are four 
such pairs of columns.) This means that Jl,.w(W') = 0; w' has an enhanced probability, and 
we have an a priori argument in favor of trying to equate hidden neurons during training. 
The argument that feature-selected nets have singular weights is architecture-dependent, 
and there might be reasonable architectures for which it fails. To illustrate the argument, 
consider the 3-2-1 architecture. Let Xl(k) and x2(k) with k = { 1, 2, 3} designate three dis- 
tinct pairs of input vectors. For each k have Xl(k ) and x2(k) be identical for all input neu- 
rons except neuron A, for which they differ. (Note there are four pairs of input vectors 
with this property, one for each of the four possible patterns over input neurons B and C.) 
Let w' be a weight vector such that both weights leaving A equal zero. For this situation 
net(w', Xl(k)) = net(w', x2(k)) for all k. In addition  net(w, Xl(k)) / wj = 
 net(w, x2(k)) / wj for all weights wj except the two which lead out of A. So k = 1 gives 
us a pair of rows of the matrix i / wj which are identical in all but two entries (one row 
for Xl(k ) and one for x2(k)). We get another such pair of rows, differing from each other in 
the exact same two entries, for k = 2, and yet another pair for k = 3. So there is a linear 
combination of these six rows which is all zeroes. This means that Jl,.w(W') = 0. This con- 
stitutes an a priori argument in favor of trying to remove input neurons during training. 
Since it doesn't favor any pw(w), the analysis of this paper doesn't favor any p({). How- 
ever when combined with empirical knowledge it suggests certain Pn,({). For example, 
there are functions g(w) which empirically are known to be good choices for p(net(w, .)) 
(e.g., g(w) oexp[ctw2]). There are usually problems with such choices of p({) though. 
For example, these g(w) usually make more sense as a prior over W than as a prior over 
tb, which would imply p(net(w, .)) = g(w) / J,w(W). Moreover it's empirically true that 
enhanced w should be favored over other w, as advised by the correction tenn. So it makes 
sense to choose a compromise between g(w) and g(w) / J,w(w). An example is p() o 
g(w) / [ 1 + tanh02 X Jo, w(W)) ] for two hyperparameters kl > 0 and k, 2 > 0. 
4 BEYOND THE CASE OF BACKPROP WITH IWl-- IXl IOl 
When O does not approximate a Euclidean vector space, elements of  have probabilities 
rather than probability densities, and P(O I t) = $dw PwiT(W I t) J(net(w, .), ), ((., .) being 
a Kronecker delta function). Moreover, if O is a Euclidean vector space but Wl > IXl IOI, 
then again one must evaluate a difficult integral;  = net(W, .) is not one-to-one so one 
must use equation (1) rather than (2). Fortunately these two situations are relatively rare. 
The final case to consider is IWl < IXl IOI (see section two). Let S(W) be the surface in  
which is the image (under net(W, .)) of W. For all  po(O) is either zero (when   S(W)) 
or infinite (when  e S(W)). So as conventionally defined, "MAP " is not meaningful. 
One way to deal with this case is to embed the net in a larger net, where that larger net's 
output is relatively insensitive to the values of the newly added weights. An alternative 
that is applicable when IWl ! IOI is an integer is to reduce X by removing "uninteresting" 
x's. A third alternative is to consider surface densities over S(W), Ps(w)(O), instead of vol- 
Bayesian Backpropagation over I-O Functions Rather Than Weights 207 
ume densities over , Pit,({). Such surface densities are given by equation (2), if one uses 
the metric form of Jn,,w(W). (Buntine has emphasized that the Jacobian form is not even 
defined for IWl < IXI IOI, since }{i / Wj is not square then (personal communication).) 
As an aside, note that restricting Pit,({) to S(W) is an example of the common theoretical 
assumption that "target functions" come from a pre-chosen "concept class". In practice 
such an assumption is usually ludicrous - whenever it is made there is an implicit hope that 
it constitutes a valid approximation to a more reasonable Pn,({). 
When decision theory is incorporated into Bayesian analysis, only rarely does it advise us 
to evaluate an MAP quantity (i.e., use BP). Instead Bayesian decision theory usually 
advises us to evaluate quantities like E( I 0 (Wolpert, 1994). Just as it does for the use of 
MAP estimators, the analysis of this paper has implications for the use of such E( I 0 
esfimators. In particular, one way to evaluate E(l 0 = ldw PWlT(W I 0 net(w, .) is to 
expand net(w, .) to low order and then approximate PWlT(W I 0 as a sum of Gaussians 
(Buntine and Weigend, 1991). Equation (4) suggests that instead we write E( I t) as 
ld{ PqT({ I t) { and approximate Pn, IT({ I 0 as a sum of Gaussians. Since fewer approxi- 
mations are used (no low order expansion of net(w, .)), this might be more accurate. 
Acknowledgements 
Thanks to David Rosen and Wray Buntine for stimulating discussion, and to TXN and the 
SFI for funding. This paper is a condensed version of (Wolpert 1994). 
References 
B untine, W., Weigend, A. (1991). Bayesian back-propagation. Complex Systems, 5, p. 603. 
Denker, J., LeCun, Y. (1991). Transforming neural-net output levels to probability distri- 
butions. In Neural Information Processing Systems 3, R. Lippman et al. (Eds). 
Fefferman, C. (1993). Reconstructing a neural net from its output. Samoff Research Cen- 
ter TR 93-01. 
Hassibi, B., and Stork, D. (1992). Second order derivatives for network pruning: optimal 
brain surgeon. Ricoh Tech Report CRC-TR-9214. 
MacKay, D. (1992). Bayesian Interpolation, and A Practical Framework for Backpropa- 
gation Networks. Neural Computation, 4, pp. 415 and 448. 
Neal, R. (1993). Bayesian learning via stochastic dynamics. In Neural Information Pro- 
cessing Systems 5, S. Hanson et al. (Eds). Morgan Kaufmann. 
Nowlan, S., and Hinton, G. (1994). Simplifying Neural Networks by Soft Weight-Sharing. 
In Theories of Induction: Proceedings of the SFI/CNLS Workshop on Formal Approaches 
to Supervised Learning, D. Wolpert (F_A.). Addison-Wesley, to appear. 
Rissanen, J. (1986). Stochastic complexity and modeling. Ann. Stat., 14, p. 1080. 
Wolpert, D. (1993). On the use of evidence in neural networks. In Neural Information Pro- 
cessing Systems 5, S. Hanson et al. (Eds). Morgan-Kauffman. 
Wolpert, D. (1994). Bayesian back-propagation over i-o functions rather than weights. SFI 
tech. report. ftp'able from archive.cis.ohio-state.edu, as pub/neuroprose/wolpert.nips.93.Z. 
