Acquisition in Autoshaping 
Sham Kakade Peter Dayan 
Gatsby Computational Neuroscience Unit 
17 Queen Square, London, England, WC1N 3AR. 
sham@gatsby. ucl. ac. uk dayan@gatsby. ucl. ac. uk 
Abstract 
Quantitative data on the speed with which animals acquire behav- 
ioral responses during classical conditioning experiments should 
provide strong constraints on models of learning. However, most 
models have simply ignored these data; the few that have attempt- 
ed to address them have failed by at least an order of magnitude. 
We discuss key data on the speed of acquisition, and show how to 
account for them using a statistically sound model of learning, in 
which differential reliabilities of stimuli play a crucial role. 
1 Introduction 
Conditioning experiments probe the ways that animals make predictions about 
rewards and punishments and how those predictions are used to their advantage. 
Substantial quantitative data are available as to how pigeons and rats acquire con- 
ditioned responses during autoshaping, which is one of the simplest paradigTns 
of classical conditioning. 4 These data are revealing about the statistical, and ulti- 
mately also the neural, substrate underlying the ways that animals learn about the 
causal texture of their environments. 
In autoshaping experiments on pigeons, the birds acquire a peck response to a 
lighted key associated (irrespective of their actions) with the delivery of food. One 
attractive feature of autoshaping is that there is no need for separate 'probe trials' 
to assess the degree of association formed between the light and the food by the 
animal rather, the rate of key pecking during the light (and before the food) can 
be used as a direct measure of this association. In particular, acquisition speeds are 
often measured by the number of trials until a certain behavioral criterion is met, 
such as pecking during the light on three out of four successive trials. a, 8' 0 
As stressed persuasively by Gallistel & Gibbon 4 (GG; forthcoming), the critical 
feature of autoshaping is that there is substantial experimental evidence on how 
acquisition speed depends on the three critical variables shown in figure 1A. The 
first is l, the inter-trial interval; the second is T, the time during the trial for which 
the light is presented; the third is the training schedule, l/S, which is the fractional 
number of deliveries per light -- some birds were only partially reinforced. 
Figure 1 makes three key points. First, figure lB shows that the median number of 
trials to the acquisition criterion depends on the ratio of I/T, and not on I and T 
separately - experiments reporte d for the same I/2" are actually performed with l 
and T differing by more than an order of magnitude? 8 Second, figure lB shows 
convincingly that the number of reinforcements is approximately inversely pro- 
portional to I/2" the relatively shorter presentation of light, the faster the learn- 
Acquisition in Autoshaping 25 
A B  
light  
\ /x time -' 
reward S = 2 
time 
1 2 5 10 20 50 
I/T 
10o00 
lO 2 3 4 5 lO 
s 
Figure 1: Autoshaping. A) Experimental paradigm. Top: the light is presented for T seconds every C 
seconds and is always followed by the delivery of food (filled circle). Bottom: the food is delivered with 
probability 1IS = 1/2 per trial. In some cases I is stochastic, with the appropriate mean. B) Log-log 
plot 4 of the number of reinforcements to a given acquisition criterion versus the I/T ratio for S = 1. 
The data are median acquisition times from 12 different laboratories. C) Log-log acquisition curves for 
various I/T ratios and S values. The main graph shows trials versus S; the inset shows reinforcements 
versus S. (1999). 
ing. Third, figure 1C shows that partial reinforcement has almost no effect when 
measured as a function of the number of reinforcements (rather than the number 
of trials), 4' 0 since although it takes $ times as many trials to acquire, there are rein- 
forcements on only 1/S trials. Changing $ does not change the effective I/T when 
measured as a function of reinforcements, so this result might actually be expected 
on the basis of figure lB, and we only consider S = 1 in this paper. Altogether, the 
data show that: 
n  300 * T/I (1) 
where n is the number of rewards to the acquisition criterion. Remarkably, these 
effects seem to hold for over an order of magnitude in both I/T and S. 
These quantitative data should be a most seductive target for statistically sound 
models of learning. However, few models have even attempted to capture the 
strong constraints they provide, and those that have attempted, all fail in critical 
aspects. The best of them, rate estimation theory 4 (RET), is closely related to the 
Rescorla-Wagner 13 (RW) model, and actually captures the proportionality in equa- 
tion 1. However, as shown below, RET grossly overestimates the observed speed 
of acquisition (underestimating the proportionality constant). Further, RET is de- 
signed to account for the time at which a particular, standard, acquisition criterion 
is met. Figure 2A shows that this is revealing only about the very early stages of 
learning -- RET is silent about the remainder of the learning curve. 
We look at additional quantitative data on learning, which collectively suggest 
that stimuli compete to predict the delivery of reward. Dayan & Long 3 (DL) dis- 
cussed various statistically inspired competitive models of classical conditioning, 
concluding with one in which stimuli are differently reliable as predictors of re- 
ward. However, DL ignored the data shown in figures 1 and 2, basing their anal- 
ysis on conditioning paradigms in which I/T was not a factor. Figures 1 and 2 
demand a more sophisticated statistical model -- building such a model is the 
focus of this paper. 
2 Rate Estimation Theory 
Gallistel & Gibbon 4 (GG; forthcoming) are amongst the strongest proponents of the 
quantitative relationships in figure 1. To account for them, GG suggest that animals 
are estimating the rates of rewards -- one, ,Xt, for the rate associated with the light 
and another, ,Xb, for the rate associated with the background context. The context is 
the ever-present environment which can itself gain associative value. The overall 
26 S. Kakade and P Dayan 
0.5 
r- 14o 
B 
o 
.120 
_o 80 
 6o 
E 
, 40 
 20 
._c 
E o 
100 200 300 400 0 4 8 64 128 256 1200 
reinforcements prior context reinforcements 
Figure 2: Additional Autoshaping Data. A) Acquisition of keypecking. The figure shows response 
rate versus reinforcements. 6 The acquisition criterion is satisfied at a relatively early time when the 
response curve crosses the acquisition criterion line. B) The effects of prior context reinforcements on 
subsequent acquisition speed. The data are taken from two experiments, 1,2 with I/T= 6. 
predicted reward rate while the light is on is At + Ao, and the rate without the light 
is just Ao. 
The additive form of the model makes it similar to Rescorla-Wagner's 13 (RW) s- 
tandard delta-rule model, for which the net prediction of the expected reward in a 
trial is the sum of the associative values of each active predictor (in this case, the 
context and light). If the rewards are modeled as being just present or absent, the 
expected value for a reward is just its probability of occurrence. Instead, RET uses 
rates, which are just probabilities per unit time. 
GG 4 formulated their model from a frequentist viewpoint. However, it is easier to 
discuss a closely related Bayesian model which suffers from the same underlying 
problem. Instead of using RW's delta-rule for learning the rates, GG assume that 
reinforcements come from a constant rate Poisson process, and make sound statis- 
tical inferences about the rates given the data on the rewards. Using an improper 
flat prior over the rates, we can write the joint distribution as: 
7(AtA0 I data) vc 7(n]A1Aotlto) c (At + A0)"e-(x*+xb)t*e 
(2) 
since all n rewards occur with the light, at rate At + A0. Here, tt = nT is the total 
time the light is on, and to = nI is the total time the light is off. 
GG take the further important step of relating the inferred rates At and A0 to the 
decision of the animals to start responding (ie to satisfy the acquisition criterion). 
GG suggest that acquisition should occur when the animals have strong evidence 
that the fractional increase in the reward rate, whilst the light is on, is greater than 
some threshold. More formally, acquisition should occur when: 
79((At + A0)/A0 > filn) = I- a 
(3) 
where ct is the uncertainty threshold and fi is slightly greater than one, reflect- 
ing the fractional increase. The n that first satisfies equation 3 can be found by 
integrating the joint probability in equation 2. It turns out that n oc tt/to, which 
has the approximate, linear dependence on the ratio I/T (as in figure lB), since 
h/to = nT/n! = T/I. It also has no dependence on partial reinforcement, as 
observed in figure 1C. 
However, even with a very low uncertainty, ct = 0.001, and a reasonable fractional 
increase, fi = 1.5, this model predicts that learning should be more than ten times 
as fast as observed, since we get n  20  T/! as opposed to the 300  T/I observed. 
20 50 
Equation 1 can only be satisfied by setting ct between 10- and 10- (depending 
on the precise values of I/T and fi)! This spells problems for GG as a normafive, 
ideal detector model of learning  it cannot, for instance, be repaired with any 
reasonable prior for the rates, as ct drops drastically with n. In other circumstances, 
A cquisia'on in 4utoshaping 2 7 
though, Gallistel, Mark & King 5 (forthcoming) have shown that animals can be 
ideal detectors of changes in rates. 
One hint of the flaw with GG is that simple manipulations to the context before 
starting autoshaping (in particular extinction) can produce very rapid learning. 2 
More generally, the data show that acquisition speed is strongly controlled by pri- 
or rewards being given only in the context (without the light present). 2 Figure 2B 
shows a parametric study of subsequent acquisition speeds during autoshaping as 
a function of the number of rewards given only with the context. This effect cannot 
simply be modeled by assuming a different prior distribution for the rates (which 
does not fix the problem of the speed of acquisition in any case), since the rate at 
which these prior context rewards were given has little effect on subsequent ac- 
quisition speed for a given number of prior reinforcements. 9 Note that the data in 
figure 2B (ie equation 1) suggest that there were about thirty prior rewards in the 
context -- this is consistent with the experimental procedures used, s-l although 
prior experience was not a carefully controlled factor. 
3 The Competitive Model 
Five sets of constraints govern our new model. First, since animals can be ideal 
detectors of rates in some circumstances? we only consider accounts under which 
their acquisition of responding has a rational statistical basis. Second, the number 
of reinforcements to acquisition must be n  300  T/I, as in equation 1. This re- 
quires that the constant of proportionality should come from rational, not absurd, 
uncertainties. Third, pecking rates after the acquisition criterion is satisfied should 
also follow the form of figure 2A (in the end, we are preventing from a norma five 
account of this by a dearth of data). Fourth, the overall learning speed should be 
strongly affected by the number of prior context rewards (figure 2B), but not by the 
rate at which they were presented. That is, the context, as an established predic- 
tor, regardless of the rate it predicts, should be able to substantially block learning 
to a less established predictor. Finally, the asymptotic accuracy of rate estimates 
should satisfy the substantial experimental data on the intrinsic uncertainty in the' 
predictions in the form of a quantitative account called scalar expectancy theory ? 
(SET). 
In our model, as in DL, an independent prediction of the rate of reward delivery is 
made on the basis of each stimulus that is present (we, for the context; cot for the 
light). These separate predictions are combined based on estimated reliabilities of 
the predictions. Here, we present a heuristic version of a more rigorously specified 
model. 2 
3.1 Rate Predictions 
SET 7 was originally developed to capture the nature of uncertainty in the way 
that animals estimate time intervals. Its most important result is that the standard 
deviation of an estimate is consistently proportional to the mean, even after an 
asymptotic number of presentations of the interval. Since the estimated time to a 
reward is just the inverse rate, asymptotic rate estimates might also be expected 
to have constant coefficients of variation. Therefore, we constrain the standard 
deviations of rate estimates not to drop below a multiple of their means. Evidence 
suggests that this multiple is about 0.2. 7 RET clearly does not satisfy this constraint 
as the joint distribution (equation 2) becomes arbitrarily accurate over time. 
Inspired by Sutton, 14 we consider Kalman filter models for independent log- 
predictions, log Wc (m) and log cot (m), on trial m. The output models for the filters 
28 $. Kakade and P Dayan 
specify the relationship between the predicted and observed rates. We use a simple 
log-normal, Af, approximation (to an underlying truly Poisson model): 
P(o(m) I w(m)) V(w(m) v  
'" , c) P(ol(m) {wl(m)) ", EJV'(wl(m),v) (4) 
where o. (m) is the observed average reward whilst predictor  is present, so if a 
reward occurs with the light in trial m, then or(m) = 1/T and oc(m) = 1/6' (where 
2 
6' = T + I). The values of v. can be determined, from the Poisson model, to be 
2 
v c = v: 1. 
The other part of the Kalman filter is a model of change in the world for the co's: 
logcoc(m) - logcoc(m -1) + ec(m) ec(m)  Af(O, (707 +1)) -) (5) 
logcot(m) = logcot(m - 1) + et(m) et(m) - A/'(0, (07 + 1)) -) (6) 
We use log(rates) so that there is no inherent scale to change in the world. Here, 
/is a constant chosen to satisfy the SET constraint, imposed as er. = */V at 
asymptote. Notice that r/acts as the effective number of rewards remembered, 
which will be less than 30, to get the observed coefficient of variation above 0.2. 
After observing the data from m trials, the posterior distributions for the predic- 
tions will become approximately: 
P(coc(m) I data) '-A/'(1/C, ac(m)) P(cot(m) { data) ,-,N'(1/T,a(m)) (7) 
and, in about m = /trials, ac(rn) - (1/C)/x/ and at(m) - (liT)Ix/. This 
captures the fastest acquisition in figure 2, and also extinction. 
3.2 Cooperative Mixture of Experts 
The two predictions (equation 7) are combined using the factorial experts model of 
Jacobs et al ii that was also used by DL. For this, during the presentation of the light 
(and the context, of course), we consider that, independently, the relationships 
between the actual reward rate r(m) and the outputs col (m) and coo(m) of 'experts' 
associated with each stimulus are: 
P(cot(m)lr(m))  JV(r(m), m--) ' P(coc(m)lr(m)) ' J(r(m), 1 
(8) 
where pt(m) - and pc(m) - are inverse variances, or reliabilities for the stimuli. 
These reliabilities reflect the belief as to how close col(m) and coc(m) are to r(m). 
The estimates are combined, giving 
P(r(m) l cot(m),coc(m))  JV(V(m), (pt(m) + pc(m)) -) 
'(rn) = 7rt(m)wt(m) + (1- 7rt(m))wc(m) 7rt(m) = pt(m)/(pt(m) + pc(m)) 
The prediction of the reward rate without the light rc (m) is determined just by the 
context value coc (m). 
In this formulation, the context can block the light's prediction if it is more reliable 
(Pc >> Pt), since rt  0, making the mean '(m)  coo(m), and this blocking occurs 
regardless of the context's rate, co (m). If Pt slowly increases, then ?(m) - cot slowly 
as rt (m) - 1. We expect this to model the post-acquisition part of the learning 
shown in figure 2A. 
A fully norma five model of acquisition would come from a statistically correct ac- 
count of how the reliabilities should change over time, which, in turn, would come 
from a statistical model of the expectations the animal has of how predictabilities 
change in the world. Unfortunately, the slow phase of learning in figure 2A, which 
should provide the most useful data on these expectations, is almost ubiquitously 
Acquisition in Autoshaping 29 
.... 
100 200 300 400 
reinforcements 
1/C 100 200 300 400 
reinforcements 
1 2 5 10 20 50 
I/T 
Figure 3: Satisfaction of the Constraints. A) The fit to the behavioral response curve (figure 2B), using 
equation 9 and r0 - 0.004. B) Possible acquisition curves showing (rn) versus m. The < > on the 
criterion line denotes the range of 15 to 120 reinforcements that are indicated by figure 2B. The -- 
curve is the same as in Fig 3A. The parameters displayed are values for r0 in multiples of r0 for the 
center curve. C) A theoretical fit to the data using equation 11. Here, a = 5% and r0v/- = 0.004. 
ignored in experiments. We therefore make two assumptions about this, which are 
chosen to fit the acquisition data, but whose normative underpinnings are unclear. 
The first assumption, chosen to obtain the slow learning curve, is that: 
rt(m) = tanh r0m (9) 
Assuming that the strength of the behavioral response is approximately propor- 
tional to r(m) - re (m), which we will estimate by rt (m) (t (m) - (m)), figure 3A 
compares the rate of key pecking in the model with the data from figure 2A. Fig- 
ure 3B shows the effect on the behavioral response of varying r0. Within just a half 
an order magnitude of variation of r0, the acquisition speeds (judged at the criteri- 
on line shown) due to between 1200 and 0 prior context rewards (figure 2B) can be 
obtained. Note the slightly counter-intuitive explanation -- the actual reward rate 
associated with the light is established very quickly -- slow learning comes from 
slow changes in the importance paid to these rates. 
We make a second assumption that the coefficient of variation of the context's pre- 
diction, from equation 8, does not change significantly for the early trials before 
the acquisition criterion is met (it could change thereafter). This gives: 
pc(m)  po/c(rn) 2 for early rn (10) 
It is plausible that the context is not becoming a relatively worse 'expert' for early 
m, since no other predictor has yet proven more reliable. 
Following GG's suggestion, we model acquisition as occurring on trial rn if 
7(r(rn) > rc(rn)ldata ) >_ 1 - c, ie if the animal has sound reasons to expect a 
higher reward rate with the light. Integrating over the Kalman filter distributions 
in equation 7 gives the distribution of r(m) - re(m) for early rn as 
P(r(rn) - re(m) I data) ,,0 A/'((tanh rom)(1/T - l/C), (p0C2) -) 
where a, (m) has dropped out due to rt (m) being small at early m. Finding the 
number of rewards, n, that satisfies the acquisition criterion gives: 
c T 
n m (11) 
I 
where the factor of c depends on the uncertainty, c, used. Figure 3C shows the 
theoretical fit to the data. 
4 Discussion 
Although a noble attempt, RET fails to satisfy the strong body of constraints under 
which any acquisition model must labor. Under RET, the acquisition of respond- 
ing cannot have a rational statistical basis, as the animal's modeled uncertainty in 
30 S. Kakade and P Dayan 
the association between light and reward at the time of acquisition is below 10 -20 . 
Further, RET ignores constraints set forth by the data establishing SET and also 
data on prior context manipulations. These latter data show that the context, re- 
gardless of the rate it predicts, will substantially block learning to a less established 
predictor. Additive models, such as RET, are unable to capture this effect. 
We have suggested a model in which each stimulus is like an 'expert' that learns 
independently about the world. Expert predictions can adapt quickly to changes 
in contingencies, as they are based on a Kalman filter model, with variances chosen 
to satisfy the constraint suggested by SET, and they can be combined based on their 
reliabilities. We have demonstrated the model's close fit to substantial experimental 
data. In particular, the new model captures the I/T dependence of the number 
of rewards to acquisition, with a constant of proportionality that reflects rational 
statistical beliefs. The slow learning that occurs in some circumstances, is due to 
a slow change in the reliabilities of predictors, not due to the rates being unable 
to adapt quickly. Although we have not shown it here, the model is also able 
to account for quantitative data as to the speed of extinction of the association 
between the light and the reward. 
The model leaves many directions for future study. In particular, we have not 
specified a sound statistical basis for the changes in reliabilities given in equation- 
s 9 and 10. Such a basis is key to understanding the slow phase of learning. Second, 
we have not addressed data from more sophisticated conditioning paradigms. For 
instance, overshadowing, in which multiple conditioned stimuli are similarly pre- 
dictive of the reward, should be able to be incorporated into the model in a natural 
way. 
Acknowledgements 
We are most grateful to Randy Gallistel and John Gibbon for freely sharing, prior 
to publication, their many ideas about timing and conditioning. We thank Sam 
Roweis for comments on an earlier version of the manuscript. Funding is from a 
NSF Graduate Research Fellowship (SK) and the Gatsby Charitable Foundation. 
References 
[1] Balsam, PD, & Gibbon, J (1988). Journal of ExperimentaI Psychology: Animal Behavior Processes, 14: 
401-412. 
[2] Balsam, PD, & Schwartz, AL (1981). Journal of Experimental Psychology: Animal Behavior Processes, 
7: 382-393. 
[3] Dayan, P, & Long, T, (1997) Neural Information Processing Systems, 10:117-124. 
[4] Gallistel, CR, & Gibbon, J (1999). Time, Rate, and Conditioning. Forthcoming. 
[5] Gallistel, CR, Mark, TS & King, A (1999). Is the Rat an Ideal Detector of Changes in Rates of Reward?. 
Forthcoming. 
[6] Gamzu, ER, & Williams, DR (1973). Journal of the Experimental Analysis of Behavior, 19:225-232. 
[7] Gibbon, J (1977). Psychological Review 84:279-325. 
[8] Gibbon, J, Baldock, MD, Locurto, C, Gold, L & Terrace, HS (1977). Journal of Experimental Psychol- 
ogy: Animal Behavior Processes, 3: 264-284. 
[9] Gibbon, J & Balsam, P (1981). In CM Locurto, HS Terrace, & J Gibbon, editors, Autoshaping and 
Conditioning Theory. 219-253. New York, NY: Academic Press. 
[10] Gibbon, J, Farrell, L, Locurto, CM, Duncan, JH & Terrace, HS (1980). Animal Learning and Behavior, 
8:45-59. 
[11] Jacobs, RA, Jordan, MI, & Barto, AG (1991). Cognitive Science 15:219-250. 
[12] Kakade, S & Dayan, P (2000). In preparation. 
[13] Rescorla, RA & Wagner, AR (1972). In AH Black & WF Prokasy, editors, Classical Conditioning II: 
Current Research and Theory, 64-69. New York, NY: Appleton-Century-Crofts. 
[14] Sutton, R (1992). In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems. 
