Computing with infinite networks 
Christopher K. I. Williams 
Neural Computing Research Group 
Department of Computer Science and Applied Mathematics 
Aston University, Birmingham B4 7ET, UK 
c. k. i. williamsason. ac. uk 
Abstract 
For neural networks with a wide class of weight-priors, it can be 
shown that in the limit of an infinite number of hidden units the 
prior over functions tends to a Gaussian process. In this paper an- 
alytic forms are derived for the covariance function of the Gaussian 
processes corresponding to networks with sigmoidal and Gaussian 
hidden units. This allows predictions to be made efficiently using 
networks with an infinite number of hidden units, and shows that, 
somewhat paradoxically, it may be easier to compute with infinite 
networks than finite ones. 
I Introduction 
To someone training a neural network by maximizing the likelihood of a finite 
amount of data it makes no sense to use a network with an infinite number of hidden 
units; the network will "overfit" the data and so will be expected to generalize 
poorly. However, the idea of selecting the network size depending on the amount 
of training data makes little sense to a Bayesian; a model should be chosen that 
reflects the understanding of the problem, and then application of Bayes' theorem 
allows inference to be carried out (at least in theory) after the data is observed. 
In the Bayesian treatment of neural networks, a question immediately arises as to 
how many hidden units are believed to be appropriate for a task. Neal (1996) has 
argued compellingly that for real-world problems, there is no reason to believe that 
neural network models should be limited to nets containing only a "small" number 
of hidden units. He has shown that it is sensible to consider a limit where the 
number of hidden units in a net tends to infinity, and that good predictions can be 
obtained from such models using the Bayesian machinery. He has also shown that 
for fixed hyperparameters, a large class of neural network models will converge to 
a Gaussian process prior over functions in the limit of an infinite number of hidden 
units. 
296 C. K. I. Williams 
Neal's argument is an existence proof--it states that an infinite neural net will 
converge to a Gaussian process, but does not give the covariance function needed 
to actually specify the particular Gaussian process. In this paper I show that 
for certain weight priors and transfer functions in the neural network model, the 
covariance function which describes the behaviour of the corresponding Gaussian 
process can be calculated analytically. This allows predictions to be made using 
neural networks with an infinite number of hidden units in time O(n3), where n 
is the number of training examples 1. The only alternative currently available is to 
use Markov Chain Monte Carlo (MCMC) methods (e.g. Neal, 1996) for networks 
with a large (but finite) number of hidden units. However, this is likely to be 
computationally expensive, and we note possible concerns over the time needed for 
the Markov chain to reach equilibrium. The availability of an analytic form for 
the covariance function also facilitates the comparison of the properties of neural 
networks with an infinite number of hidden units as compared to other Gaussian 
process priors that may be considered. 
The Gaussian process analysis applies for fixed hyperparameters 0. If it were de- 
sired to make predictions based on a hyperprior P(0) then the necessary O-space 
integration could be achieved by MCMC methods. The great advantage of integrat- 
ing out the weights analytically is that it dramatically reduces the dimensionality 
of the MCMC integrals, and thus improves their speed of convergence. 
1.1 From priors on weights to priors on functions 
Bayesian neural networks are usually specified in a hierarchical manner, so that the 
weights w are regarded as being drawn from a distribution P(wlO ). For example, 
the weights might be drawn from a zero-mean Gaussian distribution, where 0 spec- 
ifies the variance of groups of weights. A full description of the prior is given by 
specifying P(O) as well as P(volO ). The hyperprior can be integrated out to give 
P(w) = f P(wlO)P(O ) dO, but in our case it will be advantageous not to do this as 
it introduces weight correlations which prevent convergence to a Gaussian process. 
In the Bayesian view of neural networks, predictions for the output value y. cor- 
responding to a new input value :r. are made by integrating over the posterior in 
weight space. Let D - ((l,il),(2, i2),...,(n,in)) denote the n training data 
pairs, t = (tl,...,tn) T and f.(w) denote the mapping carried out by the network 
on input a. given weights w. P(wlt, O) is the weight posterior given the training 
data 2. Then the predictive distribution for y. given the training data and hyper- 
parameters 0 is 
P(y, lt, O) = /5(y, - f,(w))P(wlt, O)aw (1) 
We will now show how this can also be viewed as making the prediction using priors 
over functions rather than weights. Let f(w) denote the vector of outputs corre- 
sponding to inputs (a,...,a,) given weights w. Then, using Bayes' theorem we 
have P(wlt, O) = P(tlw)P(wlo)/P(tlo ), and P(tlw) = f P(tly) 6(y- f(w)) dy. 
Hence equation i can be rewritten as 
i // 
P(y, lt, O) = P(tlO) P(tly) 5(y, - f,(w))5(y- f(w)) P(wlo) dw dy (2) 
However, the prior over (y,, yl, ..., yn) is given by P(y,, ylO) = P(Y*IY, O)P(YIO) = 
f 6(y, - f,(w) 6(y- f(w))P(wlO ) dw and thus the predictive distribution can be 
 For large n, various approximations to the exact solution which avoid the inversion of 
an n x n matrix are available. 
2For notationaJ convenience we suppress the x-dependence of the posterior. 
Computing with Infinite Networks 297 
written as 
p(tlO) P(tly)P(y, ly, O)P(ylO ) dy- P(y, ly, O)P(ylt, O) dy 
(3) 
Hence in a Bayesian view it is the prior over function values P(y,, ylO) which is 
important; specifying this prior by using weight distributions is one valid way to 
achieve this goal. In general we can use the weight space or function space view, 
which ever is more convenient, and for infinite neural networks the function space 
view is more useful. 
2 Gaussian processes 
A stochastic process is a collection of random variables x} indexed by 
a set X. In our case X will be T d, where d is the number of inputs. The stochastic 
process is specified by giving the probability distribution for every finite subset 
of variables Y(:r),...,Y(:rk) in a consistent manner. A Gaussian process (GP) 
is a stochastic process which can be fully specified by its mean function tt() = 
E[Y(a:)] and its covariance function C(a:, ') = E[(Y(a:)- p())(Y(')- p('))]; 
any finite set of Y-variables will have a joint multivariate Gaussian distribution. For 
a multidimensional input space a Gaussian process may also be called a Gaussian 
random field. 
Below we consider Gaussian processes which have tt() -- 0, as is the case for the 
neural network priors discussed in section 3. A non-zero tt() can be incorporated 
into the framework at the expense of a little extra complexity. 
A widely used class of covariance functions is the stationary covariance functions, 
whereby C(:r, :r ) -- C(:r- :r). These are related to the spectral density (or power 
spectrum) of the process by the Wiener-Khinchine theorem, and are particularly 
amenable to Fourier analysis as the eigenfunctions of a stationary covariance kernel 
are exp ik.ce. Many commonly used covariance functions are also isotropic, so that 
C(h) = C(h) where h = :r - :r  and h = Ih[. For example C(h) = exp(-(h/a) v) 
is a valid covariance function for all d and for 0 <  < 2. Note that in this case 
a sets the correlation length-scale of the random field, although other covariance 
functions (e.g. those corresponding to power-law spectral densities) may have no 
preferred length scale. 
2.1 Prediction with Gaussian processes 
The model for the observed data is that it was generated from the prior stochastic 
process, and that independent Gaussian noise (of variance au 2) was then added. 
, , 0 '2 .. 
Given a prior covariance function C'o(i a:j), a noise process CN(i j) = 5 3 
2 at each data point) and the training data, 
(i.e. independent noise of variance 0. 
the prediction for the distribution of !/, corresponding to a test point a:, is obtained 
simply by applying equation 3. As the prior and noise model are both Gaussian the 
integral can be done analytically and P(y, It, O) is Gaussian with mean and variance 
0(,) = kp(,)(KP + KN)-t (4) 
0. (ze, ) -- Cp(ze,, ze, ) - k Tp(ze, )( Kp q - KN )- l kp(Ze, ) (5) 
where [Kli J - C(:ri,:ri) for a = P,N and kr(:r,) = (Cr(:r,,:r),..., 
Cr(:r,,:r)) '. 0.(:r,)gives the "error bars" of the prediction. 
Equations 4 and 5 are the analogue for spatial processes of Wiener-Kolmogorov 
prediction theory. They have appeared in a wide variety of contexts including 
298 C. K. L Williams 
geostatistics where the method is known as "kriging" (Journel and Huijbregts, 1978; 
Cressie 1993), multidimensional spline smoothing (Wahba, 1990), in the derivation 
of radial basis function neural networks (Poggio and Girosi, 1990) and in the work 
of Whittle (1963). 
3 Covariance functions for Neural Networks 
Consider a network which takes an input , has one hidden layer with H units and 
then linearly combines the outputs of the hidden units with a bias to obtain f(). 
The mapping can be written 
H 
(6) 
j=l 
where h(;u) is the hidden unit transfer function (which we shall assume is 
bounded) which depends on the input-to-hidden weights u. This architecture is 
important because it has been shown by Hornik (1993) that networks with one 
hidden layer are universal approximators as the number of hidden units tends to 
infinity, for a wide class of transfer functions (but excluding polynomials). Let b 
and the v's have independent zero-mean distributions of variance a and a, respec- 
tively, and let the weights uj for each hidden unit be independently and identically 
distributed. Denoting all weights by w, we obtain (following Neal, 1996) 
,[f()] = 0 (7) 
e.[h(;.)h(';.)] (8) 
= + 
= + (9) 
where equation 9 follows because all of the hidden units are identically distributed. 
2 scale as 
The final term in equation 9 becomes wEu[h(;r; u)h(a'; u)] by letting a. wlH. 
As the transfer function is bounded, all moments of the distribution will be bounded 
and hence the Central Limit Theorem can be applied, showing that the stochastic 
process will become a Gaussian process in the limit as H -- c. 
By evaluating Eu[h(ze)h(ze')] for all  and ' in the training and testing sets we can 
obtain the covariance function needed to describe the neural network as a Gaussian 
process. These expectations are, of course, integrals over the relevant probability 
distributions of the biases and input weights. In the following sections two specific 
choices for the transfer functions are considered, (1) a sigmoidal function and (2) a 
Gaussian. Gaussian weight priors are used in both cases. 
It is interesting to note why this analysis cannot be taken a stage further to integrate 
 of the v weights 
out any hyperparameters as well. For example, the variance a, might be drawn from an inverse Gamma distribution. In this case the distribution 
P(v) = f P(vlo')P(o')do'  is no longer the product of the marginal distributions 
for each v weight (in fact it will be a multivariate t-distribution). A similar analysis 
can be applied to the u weights with a hyperprior. The effect is to make the hidden 
units non-independent, so that the Central Limit Theorem can no longer be applied. 
3.1 Sigmoidal transfer function 
A sigmoidal transfer function is a very common choice in neural networks research; 
nets with this architecture are usually called multi-layer perceptrons. 
Computing with Infinite Networks 299 
d 
Below we consider the transfer function h(a:; u) = (I>(u0 + Ei=I tjXi), where (I>(z) -- 
2/vf e-tdt is the error function, closely related to the cumulative distribution 
function for the Gaussian distribution. Appropriately scaled, the graph of this 
function is very similar to the tanh function which is more commonly used in the 
neural networks literature. 
In calculating V(ze,ze') 'i2-!Eu[h(ze; u)h('; u)] we make the usual assumptions (e.g. 
MacKay, 1992) that u is drawn from a zero-mean Gaussian distribution with co- 
variance matrix Z, i.e. u ~ N(0, Z). Let  = (1, xi,..., xd) be an augmented input 
vector whose first entry corresponds to the bias. Then Verf(a:, a: ) can be written as 
1 
Verf(a:,  t) = (271.)a_[[1/2 f (uT)(uTt)exp(--uT-lu)du (10) 
This integral can be evaluated analytically 3 to give 
2 25Tz5  
Verf(:r, :r') = -sin -1 (11) 
V/(1 + 2krZk)(1 + 
We observe that this covariance function is not stationary, which makes sense as 
the distributions for the weights are centered about zero, and hence translational 
symmetry is not present. 
Consider a diagonal weight prior so that E = diag(a0 , a,..., a), so that the inputs 
i = 1,..., d have a different weight variance to the bias a02. Then for 2, 2 >> 
(1+2O-o2)/2o-, we find that Verf(Z, z' ) _ 1-20/r, where 0 is the angle between z and 
z . Again this makes sense intuitively; if the model is made up of a large number of 
sigmoidal functions in random directions (in z space), then we would expect points 
that lie diametrically opposite (i.e. at z and -z) to be anti-correlated, because 
they will lie in the + 1 and -1 regions of the sigmoid function for most directions. 
3.2 Gaussian transfer function 
One other very common transfer function used in neural networks research is the 
Gaussian, so that h(a:;u) = exp[-(a: - u)T(ze - u)/2ag2], where a9. is the width 
  g 
parameter of the Gaussian. Gaussian bass functions are often used in Radial Basis 
Function (RBF) networks (e.g. Poggio and Girosi, 1990). 
For a Gaussian prior over the distribution of u so that u ~ N(0, auaI), 
1 / (ze - u)T(ze -- u) (zd-- u)T(zd -- u) 
Va(:r, :r')-- (2ra,2)d/2 exp-- 2ag 2 exp- 2rg 2 
By completing the square and integrating out u we obtain 
uTu 
exp 2a2 
(12) 
ae exp{ ) exp{-- exp{ ) (13) 
4 2 2 2a,  + ' This formula 
2 2o'g 2 + o'g/o', and a m = 
where 1/a, 2 = 2/aa2 + 1/a 2, a, = 
can be generalized by allowing covariance matrices Eb and Eu in place of o'I and 
21; rescaling each input variable xi independently is a simple example. 
O' u 
3Introduce a dummy parameter , to make the first term in the integrand q)(,uT:). 
Differentiate the integral with respect to A and then use integration by parts. Finally 
recognize that dVerf/d, is of the form (1 -02)-/2dO/d, and hence obtain the sin - form 
of the result, and evaluate it at A = 1. 
300 C. K. L Williams 
Again this is a non-stationary covariance function, although it is interest- 
2 (while scaling w  appropriately) we find that 
ing to note that if a u - 
V6(:r,z') c< exp{-(:r - z')T(z -- z')/qra2 } 4. For a finite value of r, 
is a stationary covariance function "modulated" by the Gaussian decay function 
2 is much larger than the largest 
exp(--:rT:r/2O'2rnexp(--:r'T:r'/2o'). Clearly if 
distance in :r-space then the predictions made with VG and a Gaussian process with 
only the stationary part of VG will be very similar. 
It is also possible to view the infinite network with Gaussian transfer functions as 
an example of a shot-noise process based on an inhomogeneous Poisson process 
(see Parzen (1962) 4.5 for details). Points are generated from an inhomogeneous 
Poisson process with the rate function oc exp(--:rT:r/2r), and Gaussian kernels of 
height v are centered on each of the points, where v is chosen lid from a distribution 
2 
with mean zero and variance 
3.3 Comparing covariance functions 
The priors over functions specified by sigmoidal and Gaussian neural networks differ 
from covariance functions that are usually employed in the literature, e.g. splines 
(Wahba, 1990). How might we characterize the different covariance functions and 
compare the kinds of priors that they imply ? 
The complex exponential exp ile.ze is an eigenfunction of a stationary and isotropic 
covariance function, and hence the spectral density (or power spectrum) S(k) 
(k = Ikl) nicely characterizes the corresponding stochastic process. Roughly speak- 
ing the spectral density describes the "power" at a given spatial frequency k; for 
example, splines have $(k) or k -. The decay of $(k) as k increases is essential, 
as it provides a smoothing or damping out of high frequencies. Unfortunately non- 
stationary processes cannot be analyzed in exactly this fashion because the complex 
exponentials are not (in general) eigenfunctions of a non-stationary kernel. Instead, 
we must consider the eigenfunctions defined by f C(:r, :r')gS(:r')d:r' = Aqb(:r). How- 
ever, it may be possible to get some feel for the effect of a non-stationary covariance 
function by looking at the diagonal elements in its 2d-dimensional Fourier trans- 
form, which correspond to the entries in power spectrum for stationary covariance 
functions. 
3.4 Convergence of finite network priors to GPs 
From general Central Limit Theorem results one would expect a rate of convergence 
of H-U2 towards a Gaussian process prior. How many units will be required 
in practice would seem to depend on the particular values of the weight-variance 
parameters. For example, for Gaussian transfer functions, a, defines the radius 
over which we expect the process to be significantly different from zero. If this 
2 
radius is increased (while keeping the variance of the basis functions a fixed] then 
naturally one would expect to need more hidden units in order to achieve the same 
level of approximation as before. Similar comments can be made for the sigmoidal 
case, depending on (1 + 
I have conducted some experiments for the sigm6idal transfer [unc_tion , comparing 
the predictive performance of a finite neural network with one input unit to the 
equivalent Gaussian process on data generated from the GP. The finite network 
simulations were carried out using a slightly modified version of Neal's MCMC 
Bayesian neural networks code (Neal, 1996) and the inputs were drawn from a 
4 Note that this would require w 2 -- co and hence the Central Limit Theorem would no 
longer hold, i.e. the process would be non-Gaussian. 
Computing with Infinite Networks 301 
N(0, 1) distribution. The hyperparameter settings were al - 10.0, r0 - 2.0, rv - 
1.189 and ab - 1.0. Roughly speaking the results are that 100's of hidden units 
are required before similar performance is achieved by the two methods, although 
there is considerable variability depending on the particular sample drawn from the 
prior; sometimes 10 hidden units appears sufficient for good agreement. 
4 Discussion 
The work described above shows how to calculate the covariance function for sig- 
moidal and Gaussian basis functions networks. It is probable similar techniques will 
allow covariance functions to be derived analytically for networks with other kinds 
of basis functions as well; these may turn out to be similar in form to covariance 
functions already used in the Gaussian process literature. 
In the derivations above the hyperparameters 0 were fixed. However, in a real data 
analysis problem it would be unlikely that appropriate values of these parameters 
would be known. Given a prior distribution P(0) predictions should be made by 
integrating over the posterior distribution P(0[t) cr P(O)P(tIO), where P(t[0) is 
the likelihood of the training data t under the model; P(tlO ) is easily computed for 
a Gaussian process. The prediction y(:r) for test input :r is then given by 
(a:) = / )O(a)P(O[D)dO (14) 
where )0(a) is the predicted mean (as given by equation 4) for a particular value 
of 0. This integration is not tractable analytically but Markov Chain Monte Carlo 
methods such as Hybrid Monte Carlo can be used to approximate it. This strategy 
was used in Williams and Rasmussen (1996), but for stationary covariance functions, 
not ones derived from Gaussian processes; it would be interesting to compare results. 
Acknowledgements 
I thank David Saad and David Barber for help in obtaining the result in equation 11, and 
Chris Bishop, Peter Dayan, Ian Nabney, Radford Neal, David Saad and Hualyu Zhu for 
comments on an earher draft of the paper. This work was partially supported by EPSRC 
grant GR/J75425, "Novel Developments in Learning Theory for Neural Networks". 
References 
Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley. 
Hornik, K. (1993). Some new results on neural network approximation. Neural Net- 
works 6 (8), 1069-1072. 
Journel, A. G. and C. J. Huijbregts (1978). Mining Geostatistics. Academic Press. 
MacKay, D. J. C. (1992). A Practical Bayesian Framework for Backpropagation Net- 
works. Neural Computation 4(3), 448-472. 
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture Notes in 
Statistics 118. 
Parzen, E. (1962). Stochastic Processes. Holden-Day. 
Poggio, T. and F. Girosi (1990). Networks for approximation and learning. Proceedings 
of IEEE 78, 1481-1497. 
Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Ap- 
phed Mathematics. CBMS-NSF Regional Conference series in apphed mathematics. 
Whittle, P. (1963). Prediction and regulation by linear least-square methods. Enghsh 
Universities Press. 
Williams, C. K. I. and C. E. Rasmussen (1996). Gaussian processes for regression. In 
D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in Neural 
Information Processing Systems 8, pp. 514-520. MIT Press. 
