On the Use of Evidence in Neural Networks 
David H. Wolpert 
The Santa Fe Institute 
1660 Old Pecos Trail 
Santa Fe, NM 87501 
Abstract 
The Bayesian "evidence" approximation has recently been employed to 
determine the noise and weight-penalty terms used in back-propagation. 
This paper shows that for neural nets it is far easier to use the exact result 
than it is to use the evidence approximation. Moreover, unlike the evi- 
dence approximation, the exact result neither has to be re-calculated for 
every new data set, nor requires the running of computer code (the exact 
result is closed form). In addition, it tums out that the evidence proce- 
dure's MAP estimate for neural nets is, in toto, approximation error. An- 
other advantage of the exact analysis is that it does not lead one to incor- 
rect intuition, like the claim that using evidence one can "evaluate differ- 
ent priors in light of the data". This paper also discusses sufficiency 
conditions for the evidence approximation to hold, why it can sometimes 
give "reasonable" results, etc. 
1 THE EVIDENCE APPROXIMATION 
It has recently become popular to consider the problem of training neural nets from a Baye- 
sian viewpoint (Buntine and Weigend 1991, MacKay 1992). The usual way of doing this 
starts by assuming that there is some underlying target function f from R n to R, parameter- 
ized by an N-dimensional weight vector w. We are provided with a training set L of noise- 
corrupted samples of f. Our goal is to make a guess for w, basing that guess only on L. Now 
assume we have i.i.d. additive gaussian noise resulting in P(L I w, 13) o exp(-13 %2)), where 
%2(w, L) is the usual sum-squared training set error, and [3 reflects the noise level. Assume 
further that P(w I cz) o exp(-czW(w)), where W(w) is the sum of the squares of the weights. 
If the values of cz and 13 are known and fixed, to the values a t and 13t respectively, then P(w) 
539 
540 Wolpert 
= P(w I at) and P(L I w) = P(L I w, 13t). Bayes' theorem then tells us that the posterior is 
proportional to the product of the likelihood and the prior, i.e., P(w I L) o P(L I w) P(w). 
Consequently, finding the maximum a posteriori (MAP) w - the w which maximizes 
P(w I L) - is equivalent to finding the w minimizing %2(w, L) + ( / 13t)W(w). This can be 
viewed as a justification for performing gradient descent with weight-decay. 
One of the difficulties with the foregoing is that we almost never know  and t in real- 
world problems. One way to deal with this is to estimate  and 13t, for example via a tech- 
nique like cross-validation. In contrast, a Bayesian approach to this problem would be to 
set priors over c and 13, and then examine the consequences for the posterior of w. 
This Bayesian approach is the starting point for the "evidence" approximation created by 
Gull (Gull 1989). One makes three assumptions, for P(w I T), P(L I w, T), and P(T). (For sim- 
plicity of the exposition, from now on the two hyperparameters c and 13 will be expressed 
as the two components of the single vector T.) The quantity of interest is the posterior: 
P(w I L): ( d5 P(w, 51 L) 
= J d5 [(P(w, 51L) / P(51L)} xP(51L)] (1) 
The evidence approximation suggests that if P(51 L) is sharply peaked about 5 = , while 
the term in curly brackets is smooth about 5 = , then one can approximate the w-depen- 
dence of P(w I L) as P(w, 5/I L) / P( I L) o P(L I w, ) P(w I ). In other words, with the 
evidence approximation, one sets the posterior by taking P(w) = P(w I ) and P(L I w) = P(L 
I w, ), where T' is the MAP 5. P(L 15) = f dw [P(L I w, 5) P(w 15)] is known as the"evidence" 
for L given 5. For relatively smooth P(5), the peak of P(51 L) is the same as the peak of the 
evidence (hence the name "evidence approximation"). Although the current discussion will 
only explicitly consider using evidence to set hyperparameters like c and 13, most of what 
will be said also applies to the use of evidence to set other characteristics of the learner, like 
its architecture. 
MacKay has applied the evidence approximation to finding the posterior for the neural net 
P(w I (x) and P(L I w, 13) recounted above combined with a P(5) = P(c, 13) which is uniform 
over all c and 13 from 0 to +oo (MacKay 1992). In addition to the error introduced by the 
evidence approximation, additional error is introduced by his need to approximate . 
MacKay states that although he expects his approximation for T' to be valid, "it is a matter 
of further research to establish [conditions for] this approximation to be reliable". 
2 THE EXACT CALCULATION 
It is always true that the exact posterior is given by 
P(w) : I dyP(w 1 5) P(5), 
P(L I w) = Id5 {P(L I w, 5) x P(w 1 5) x P(5)} / P(w); 
P(wlL) : .fds{P(LIw, 5)xP(w I)xP(5)} (2) 
where the proportionality constant, being independent of w, is irrelevant. 
Using the neural net P(w I a) and P(L I w, 13) recounted above, and MacKay's P(5), it is 
trivial to use equation 2 to calculate that P(w) : [W(w)] -(Na + 1), where N is the number of 
weights. Similarly, with m the number of pairs in L, P(L I w) : [%2(w, L)] '(m/2 + 1). (See 
(Wolpert 1992) and (Buntine and Weigend 1991), and allow the output values in L to range 
On the Use of Evidence in Neural Networks 541 
from _oo to +oo.) These two results give us the exact expression for the posterior P(w I L). 
In contrast, the evidence-approximated posterior o exp[-c'(L) W(w) - 13'(L) %2(w, L)]. 
It is illuminating to compare this exact calculation to the calculation based on the evidence 
approximation. A good deal of relatively complicated mathematics followed by some com- 
puter-based numerical estimation is necessary to arrive at the answer given by the evidence 
approximation. (This is due to the need to approximate ,/.) In contrast, to perform the exact 
calculation one only need evaluate a simple gaussian integral, which can be done in closed 
form, and in particular one doesn't need to perform any computer-based numerical estima- 
tion. In addition, with the evidence procedure ' must be re-evaluated for each new data set, 
which means that the formula giving the posterior must be re-derived every time one uses 
a new data set. In contrast, the exact calculation's formula for the posterior holds for any 
data set; no re-calculations are required. So as a practical tool, the exact calculation is both 
far simpler and quicker to use than the calculation based on the evidence approximation. 
Another advantage of the exact calculation, of course, is that it is exact. Indeed, consider 
the simple case where the noise is fixed, i.e., P(T) = P(T1) T2 - 13t), so that the only term 
we must "deal with" is T1 = c. Set all other distributions as in (MacKay 1992). For this case, 
the w-dependence of the exact posterior can be quite different from that of the evidence- 
approximated posterior. In particular, note that the MAP estimate based on the exact calcu- 
lation is w = 0. This is, of course, a silly answer, and reflects the poor choice of distributions 
made in (MacKay 1992). In particular, it directly reflects the un-normalizability of MacK- 
ay's P(c). However the important point is that this is the exactly correct answer for those 
distributions. On the other hand, the evidence procedure will result in an MAP estimate of 
argmin w [%2(w, L) + (c'/13')W(w)], where c' and 13' are derived from L. This answer will 
often be far from the correct answer ofw = 0. Note also that the evidence approximations's 
answer will vary, perhaps greatly, with L, whereas the correct answer is L-independent. Fi- 
nally, since the correct answer is w = 0, the difference between the evidence procedure's 
answer and the correct answer is equal to the evidence procedure's answer. In other words, 
although there exist scenarios for which the evidence approximation is valid, neural nets 
with flat P(0 is not one of them; for this scenario, the evidence procedure's answer is in 
toto approximation error. (A possible reason for this is presented in section 4.) 
If one were to use a more reasonable P(c), uniform only from 0 to an upper cut-off Omax, 
the results would be essentially the same, for large enough %nax' The effect on the exact 
posterior, to first order, is to introduce a small region around w = 0 in which P(w) behaves 
like a decaying exponential in W(w) (the exponent being set by %nax) rather than like 
[W(w)]'(Nt2 +  ) (T. Wallstrom, private communication). For large enough Ctma x, the region 
is small enough so that the exact posterior still has a peak very close to 0. On the other hand, 
for large enough O[max, there is no change in the evidence procedure's answer. (Generically, 
the major effect on the evidence procedure of modifying P(T) is not to change its guess for 
P(w I L), but rather to change the associated error, i.e., change whether sufficiency condi- 
tions for the validity of the approximation are met. See below.) Even with a normalizable 
prior, the evidence procedure's answer is still essentially all approximation error. 
Consider again the case where the prior over both c and 13 is uniform. With the evidence 
approximation, the log of the posterior is -{ Z2(w, L) + (c' / 13')W(w) }, where c' and 13' 
are set by the data. On the other hand, the exact calculation shows that the log of the pos- 
542 Wolpert 
terior is really given by -{ ln[%2(w, L)] + (N+2/m+2) ln[W(w)] }. What's interesting about 
this is not simply the logarithms, absent from the evidence approximation's answer, but also 
the factor multiplying the term involving the "weight penalty" quantity W(w). In the evi- 
dence approximation, this factor is data-dependent, whereas in the exact calculation it only 
depends on the number of data. Moreover, the value of this factor in the exact calculation 
tells us that if the number of weights increases, or alternatively the number of training ex- 
amples decreases, the "weight penalty" term becomes more important, and fitting the train- 
ing examples becomes less important. (It is not at all clear that this trade-off between N and 
m is reflected in (tz' / 13'), the corresponding factor from the evidence approximation.) As 
before, if we have upper cut-offs on P(T), so that the MAP estimate may be reasonable, 
things don't change much. For such a scenario, the N vs. m trade-off governing the relative 
importance of W(w) and %2(w, L) still holds, but only to lowest order, and only in the region 
sufficiently far from the ex-singularities (like w = 0) so that P(w I L) behaves like 
[W(w)]-(N/2 + 1) X [g2(w, L)] -(rn/2 + 1) 
All of this notwithstanding, the evidence approximation has been reported to give good re- 
suits in practice. This should not be all that surprising. There are many procedures which 
are formally illegal but which still give reasonable advice. (Some might classify all of non- 
Bayesian statistics that way.) The evidence procedure fixes T to a single value, essentially 
by maximum likelihood. That's not unreasonable, just usually illegal (as well as far more 
laborious than the correct Bayesian procedure). 
In addition, the tests of the evidence approximation reported in (MacKay 1992) are not all 
that convincing. For paper 1, the evidence approximation gives tz' = 2.5. For any other tz in 
an interval extending three orders of magnitude about this cz', test set error is essentially 
unchanged (see figure 5 of (MacKay 1992)). Since such error is what we're ultimately in- 
terested in, this is hardly a difficult test of the evidence approximation. In paper 2 of 
(MacKay 1992) the initial use of the evidence approximation is "a failure of Bayesian pre- 
diction"; P(TI L) doesn't correlate with test set error (see figure 7). MacKay addresses this 
by arguing that poor Bayesian results are never wrong, but only "an opportunity to learn" 
(in contrast to poor non-Bayesian results?). Accordingly, he modifies the system while 
looking at the test set, to get his desired correlation on the test set. To do this legally, he 
should have instead modified his system while looking at a validation set, separate from the 
test set. However if he had done that, it would have raised the question of why one should 
use evidence at all; since one is already assuming that behavior on a validation set corre- 
sponds to behavior on a test set, why not just set tz and 13 via cross-validation? 
3 EVIDENCE AND THE PRIOR 
Consider again equation 1. Since h/depends on the data L, it would appear that when the 
evidence approximation is valid, the data determines the prior, or as MacKay puts it, "the 
modern Bayesian ... does not assign the priors - many different priors can be ... compared 
in the light of the data by evaluating the evidence" (MacKay 1992). If this were true, it 
would remove perhaps the most major objection which has been raised concerning Baye- 
sian analysis - the need to choose priors in a subjective manner, independent of the data. 
However the exact P(w) given by equation 2 is data-independent. So one has chosen the 
prior, in a subjective way. The evidence procedure is simply providing a data-dependent 
approximation to a data-independent quantity. In no sense does the evidence procedure al- 
low one to side-step the need to make subjective assumptions which fix P(w). 
On the Use of Evidence in Neural Networks 543 
Since the true P(w) doesn't vary with L whereas the evidence approximation's P(w) does, 
one might suspect that that approximation to P(w) can be quite poor, even when the evi- 
dence approximation to the posterior is good. Indeed, if P(w I T1) is exponential, there is no 
non-pathological scenario for which the evidence approximation to P(w) is correct: 
Theorem 1: Assume that P(w I TI) o e-Y1 U(w). Then the only way that one can have 
P(w) o e -or U(w) for some constant c is/fP(TI) = O for all T1  0. 
Proof: Our proposed equality is exp(-c x U) = fdTl {P(TI) X exp(-1 x U)} (the normaliza- 
tion factors having all been absorbed into P(TI)). We must find an c and a norrealizable 
P(1) such that this equality holds for all allowed U. Let u be such an allowed value of U. 
Take the derivative with respect to U of both sides of the proposed equality t times, and 
evaluate for U = u. The result is c t = .fdT1 ((Ti) t x R(qq)) for any integer t > 0, where R(T1) -- 
P(1) exp(u(c - 1)). Using this, we see that fdTl((Tl - c) 2 x R(Ti)) = 0. Since both R(T1) and 
(T1 - ct) 2 are nowhere negative, this means that for all T1 for which (TI - ct) 2 ; 0, R(qq) must 
equal zero. Therefore R(Ti) must equal zero for all ]q , c. QED. 
Since the evidence approximation for the prior is always wrong, how can its approximation 
for the posterior ever be good? To answer this, write P(w I L) = P(L I w) x [P'(w) + E(w)] / 
P(L), where P'(w) is the evidence approximation to P(w). (It is assumed that we know the 
likelihood exactly.) This means that P(w I L) ~ {P(L I w) x P'(w) / P(L)}, the error in the 
evidence procedure's estimate for the posterior, equals P(L I w) x E(w) / P(L). So we can 
have arbitrarily large E(w) and not introduce sizable error into the posterior of w, but only 
for those w for which P(L I w) is small. As L varies, the w with non-negligible likelihood 
vary, and the T such that for those w P(w I T) is a good approximation to P(w) varies. When 
it works, the 3/given by the evidence approximation reflects this changing of T with L. 
4 SUFFICIENCY CONDITIONS FOR EVIDENCE TO WORK 
Note that regardless of how peaked the evidence is, -{ %2(w, L) + (oC / 13')W(w) }  
-{ ln[%2(w, L)] + (N+2 / m+2) ln[W(w)] }; the evidence approximation always has non- 
negligible error for neural nets used with flat P(T). To understand this, one must carefully 
elucidate a set of sufficiency conditions necessary for the evidence approximation to be val- 
id. (Unfortunately, this has never been done before. A direct consequence is that no one has 
ever checked, formally, that a particular use of the evidence approximation is justified.) 
One such set of sufficiency conditions, the one implicit in all attempts to date to justify the 
evidence approximation (i.e., the one implicit in the logic of equation 1), is the following: 
P(T I L) is sharply peaked about a particular 7, )/- 
P(w, 71 L) / P(71 L) varies slowly around 7 = - 
P(w, T I L) is infinitesimal for all 7 sufficiently far from 
(i) 
(ii) 
(iii) 
Formally, condition (iii) can be taken to mean that there exists a not too large positive con- 
stant k, and a small positive constant 5, such that I P(w I L) - k +f5 d7 P(w, 71 L) I is 
bounded by a small constant  for all w. (As stated, (iii) has k = 1. This will almost always 
544 Wolpert 
be the case in practice and will usually be assumed, but it is not needed to prove theorem 
2.) Condition (ii) can be taken to mean that across [y - 15, y + 15], IP(w I, L) - P(w IT' L)l < 
:, for some small positive constant :, for all w. (Here and throughout this paper, when  is 
multi-dimensional, "15" is taken to be a small positive vector.) 
Theorem 2: When conditions (i), (ii), and (iii) hold, P(w I L) -_- P(L I w, )/) x P(w I )/), up 
to an (irrelevant) overall proportionality constant. 
Proof.' Condition (iii) gives I P(w I L) - k ! +'5 dT [P(w I T L) x P(/I L)] I < e for all w. 
However Ik.f +8 d[P(wl),,L)xPO, IL)I- kP(wlyL)l +8 dyP(1L) I < ';k x 
.f_Yt5 +6 dT PO' I L), by condition (ii). If we now combine these two results, we see that 
f +t5 f 
I P(w I L) - k P(w I )/L) J-4 d P(y I L) I <  + k x J-4 dy P( I L). Since the 
r + 
integral is bounded by 1, I P(w I L) - k P(w I )/L) J-4 dT P(T I L) I <  + k. Since the 
integral is independent of w, up to an overall proportionality constant (that integral times 
k) the w-dependence of P(w I L) can be approximated by that of P(w I 
x P(w I ),'), incurring error less than  + :k. Take k not too large and both e and : small. QED. 
Note that the proof would go through even if P( I L) were not peaked about , or if 
P(TI L) were peaked about some point far from the ),' for which (ii) and (iii) hold; nowhere 
in the proof is the definition of ),' from condition (i) used. However in practice, when con- 
dition (iii) is met, k = 1, P( I L) falls to 0 outside of [)/- 15, )/+ 15], and P(w I , L) stays 
reasonably bounded for all such . (If this weren't the case, then P(w I T, L) would have to 
fall to 0 outside of [),' - 15, ),' + 15], something which is rarely true.) So we see that we could 
either just give conditions (ii) and (iii), or we could give (i), (ii), and the extra condition that 
P(w I T, L) is bounded small enough so that condition (iii) is met. (In addition, one can prove 
that if the evidence approximation is valid, then conditions (i) and (ii) give condition (iii).) 
In any case, it should be noted that conditions (i) and (ii) by themselves are not sufficient 
for the evidence approximation to be valid. To see this, have w be one-dimensional, and let 
P(w, T I L) = 0 both for {I T - )/I < 15, Iw - w*l < v} and for {I T - )/I > 15, Iw - w*l > v}. Let it 
be constant everywhere else (within certain bounds of allowed 5, and w). Then for both 15 
and v small, conditions (i) and (ii) hold: the evidence is peaked about ),', and : = O. Yet for 
the true MAP w, w*, the evidence approximation fails badly. (Generically, this scenario 
will also result in a big error if rather than using the , posterior 
evidence-approximated to 
guess the MAP w, one instead uses it to evaluate the posterior-averaged f, Jdf f P(f I L).) 
Gull mentions only condition (i). MacKay also mentions condition (ii), but not condition 
(iii). Neither author plugs in for  and x, or in any other way uses their distributions to infer 
bounds on the error accompanying their use of the evidence approximation. 
Since by (i) P(T I L) is sharply peaked about )/, one would expect that for (ii) to hold 
P(w, T I L) must also be sharply peaked about )/. Although this line of reasoning can be for- 
malized, it tums out to be easier to prove the result using sufficiency condition (iii): 
Theorem 3: If condition (iii) holds, then for all w such that P(w I L) > c > ,for each com- 
ponent i ofT, P(w, Ti I L) must have a Ti-peak somewhere within 15i[1 + 2 / (c - It)] of(Y')i. 
r 
Proof.' Condition (iii) with k = 1 tells us that P(w I L) - J-4 15 TP(w, 1 L) < . Extending 
On the Use of Evidence in Neural Networks $45 
t (-)i 
the integrals over 'i gives P(w I L) - j(_t5)i dTi P(w, Ti I L) < t. From now on the i 
f 
subscript on T and 15 will be implicit. We have  > j+ dT P(w, T I L) for any scalar r 
> 0. Assume that P(w, T I L) doesn't have a peak anywhere in [ - 8,  + 15 + r]. Without 
loss of generality, assume also that P(w, T' + 151 L) > P(w,  - 151L). These two assumptions 
mean that for any T  [ + 15, Y + 15 + r], the value of P(w, T I L) exceeds the maximal value 
r +8+r 
it takes on in the interval [ - 15,  + 15]. Therefore j+ dT P(w, T I L) > (r / 215) x 
-4 dTP(w, TIL).ThismcansthatY4 +b dTP(w,T IL) < 215/r. ButsinceP(wlL)< 
 + .fyY_6 + dTP(w, T I L), this means that P(w I L) < (1 + 215 / r). So ifP(w I L) > c > , r 
< 2 / (c - ), and there must be a peak of P(w, T I L) within 15(1 + 2e/(c - )) of . QED. 
So for those w with non-negligible posterior, for e small, the y-peak of P(w, T I L) 
P(L I w, T) x P(w I T) X P(T) must lie essentially within the peak of P(T I L). Therefore: 
Theorem 4: Assume that P(w I T1) = exp(-T1 U(w)) / Zi(T1)for some function U(.), 
P(L Iw, T2) = exp(-T2 V(w, L)) / Z2(T2, w)for some function V(., .), andP(T) = P(T1)P(T2). 
(The Z i act as normalization constants.) Then if condition (iii) holds, for all w with non- 
negligible posterior the y-solution to the equations 
-U(w) + o1 [ln(P(1) - In(Zi(T1))] = 0 
-V(w, L) + a2 [In(P(T2) - In(Z2(T2, w))] = 0 
must like within the y-peak ofP(T I L). 
Proof: P(w, TI L) =: {P(T1) x P0/2) x exp[-T1U(w ) - T2 V(w, L)] } / {Zi(T1 ) x Z,2(T2, w)}. 
For both i = 1 and i = 2, evaluate ai {'f dTJ i P(w, T I L) }, and set it equal to zero. This gives 
the two equations. Now define "the y-peak of P(T I L)" to mean a cube with i-component 
width 15i[1 + 2 / (c - )], centered on y, where having a "non-negligible posterior" means 
P(w I L) > c. Applying theorem 3, we get the result claimed. QED. 
In particular, in MacKay's scenario, P(T) is uniform, W(w) = r= 1 (Wi) 2, and V(W, L) = 
%2(w, L). Therefore Z 1 and Z2 are proportional to (T1) 'Nt2 and (T2) -mr2 respectively. This 
means that if the vector {T1, T2} = {N / [2W(w)], m / [2%2(w, L)]} does not lie within the 
peak of the evidence for the MAP w, condition (iii) does not hold. That Y1 / 2 must ap- 
proximately equal IN %2(w, L)] / [m W(w)] should not be too surprising. If we set the w- 
gradient of both the evidence-approximated and exact P(w I L) to zero, and demand that the 
same w,w', solves both equations, we get y1 / 2 = -[(N + 2) ;2(w', L)] / [(m + 2)W(w')]. 
(Unfortunately, if one continues and evaluates wiwjP(w I L) at w', often one finds that it 
has opposite signs for the two posteriors - a graphic failure of the evidence approximation.) 
It is not clear from the provided neural net data whether this condition is met in (MacKay 
1992). However it appears that the corresponding condition is not met, for TI at least, for 
the scenario in (Gull 1992) in which the evidence approximation is used with U(.) being the 
entropy. (See (Strauss et al. 1993, Wolpert et al. 1993).) Since conditions (i) through (iii) 
546 Wolpert 
are sufficient conditions, not necessary ones, this does not prove that Gull's use of evidence 
is invalid. (It is still an open problem to delineate the full iff for when the evidence approx- 
imation is valid, though it appears that matching of peaks as in theorem 3 is necessary. See 
(Wolpert et al. 1993).) However this does mean that the justification offered by Gull for his 
use of evidence is apparently invalid. It might also help explain why Gull's results were "vi- 
sually disappointing and ... clearly ... 'over-fitted'", to use his terms. 
The first equation in theorem 4 can be used to set restrictions on the set of w which both 
have non-negligible posterior and for which condition (iii) holds. Consider for example 
MacKay's scenario, where that equation says that N / 2W(w) must lie within the width of 
the evidence peak. If the evidence peak is sharp, this means that unless all w with non-neg- 
ligible posterior have essentially the same W(w), condition (iii) can not hold for all of them. 
Finally, if for some reason one wishes to know ]/, theorem 4 can sometimes be used to cir- 
cumvent the common difficulty of evaluating P(htl L). To do this, one assumes that condi- 
tions (i) through (iii) hold. Then one finds any w with a non-negligible posterior (say by use 
of the evidence approximation coupled with approximations to P(T I L)) and uses it in the- 
orem 4 to find a T which must lie within the peak of P(T I L), and therefore must lie close to 
the correct value of . 
To summarize, there might be scenarios in which the exact calculation of the quantity of 
interest is intractable, so that some approximation like evidence is necessary. Alternatively, 
if one's choice of P(w IT), P(ht), and P(L I w, ) is poor, the evidence approximation would 
be useful if the error in that approximation somehow "cancels" error in the choice of distri- 
butions. However if one believes one's choice of distributions, and if the quantity of interest 
is P(w I L), then at a minimum one should check conditions (i) through (iii) before using 
the evidence approximation. When one is dealing with neural nets, one needn't even do 
that; the exact calculation is quicker and simpler than using the evidence approximation. 
Acknowledgments 
This work was done at the SFI and was supported in part by NLM grant F37 LM00011. I 
would like to thank Charlie Strauss and Tim Wallstrom for stimulating discussion. 
References 
Buntine, W., Weigend, A. (1991). Bayesi.an back-propagation. Complex Systems, 5, 603. 
Gull, S.F. (1989). Developments in maximum entropy data analysis. In "Maximum-entro- 
py and Bayesian methods", J. Skilling (Ed.). Kluwer Academics publishers. 
MacKay, D.J.C. (1992). Bayesian Interpolation. A Practical Framework for Backpropaga- 
tion Networks. Neural Computation, 4, 415 and 448. 
Strauss, C.E.M, Wolpert, D.H., Wolf, D.R. (1993). Alpha, Evidence, and the Entropic Pri- 
or. In "Maximum-entropy and Bayesian methods", A. Mohammed-Djafari (Ed.). Kluwer 
Academics publishers. In press 
Wolpert, D.H. (1992). A Rigorous Investigation of "Evidence" and "Occam Factors" in 
Bayesian Reasoning. SFI TR 92-03-13. Submitted. 
Wolpert, D.H., Strauss, C.E.M., Wolf, D.R. (1993). On evidence and the marginalization 
of alpha in the entropic prior. In preparation. 
PART vI 
NETWORK 
DYNAMICS AND 
CHAOS 
