Generalisation of A Class of Continuous 
Neural Networks 
John Shawe-Taylor 
Dept of Computer Science, 
Royal Holloway, University of London, 
Egham, Surrey TW20 0EX, UK 
Emai]: j ohndcs. rhbnc. ac. uk 
Jieyu Zhao* 
IDSIA, Corso Elvezia 36, 
6900-Lugano, Switzerland 
Email: j ieyu$arota. idsia. h 
Abstract 
We propose a way of using boolean circuits to perform real valued 
computation in a way that naturally extends their boolean func- 
tionahty. The functionahty of multiple fan in threshold gates in 
this model is shown to mimic that of a hardware implementation 
of continuous Neural Networks. A Vapnik-Chervonenkis dimension 
and sample size analysis for the systems is performed giving best 
known sample sizes for a real valued Neural Network. Experimen- 
tal results confirm the conclusion that the sample sizes required for 
the networks are significantly smaller than for sigmoidal networks. 
1 Introduction 
Recent developments in complexity theory have addressed the question of com- 
plexity of computation over the real numbers. More recently attempts have been 
made to introduce some computational cost related to the accuracy of the compu- 
tations [5]. The model proposed in this paper weakens the computational power 
still further by relying on classical boolean circuits to perform the computation us- 
ing a simple encoding of the real values. Using this encoding we also show that 
TO0 circuits interpreted in the model correspond to a Neural Network design re- 
ferred to as Bit Stream Neural Networks, which have been developed for hardware 
implementation [8]. 
With the perspective afforded by the general approach considered here, we are also 
able to analyse the Bit Stream Neural Networks (or indeed any other adaptive sys- 
tem based on the technique), giving VC dimension and sample size bounds for PAC 
learning. The sample sizes obtained are very similar to those for threshold networks, 
*Work performed while at Royal Holloway, University of London 
268 J. SHAWE-TAYLOR, J. ZHAO 
despite their being derived by very different techniques. They give the best bounds 
for neural networks involving smooth activation functions, being significantly lower 
than the bounds obtained recently for sigmoidal networks [4, 7]. 
We subsequently present simulation results showing that Bit Stream Neural Net- 
works based on the technique can be used to solve a standard benchmark problem. 
The results of the simulations support the theoretical finding that for the same 
sample size generalisation will be better for the Bit Stream Neural Networks than 
for classical sigmoidal networks. It should also be stressed that the approach is 
very general - being applicable to any boolean circuit - and by its definition em- 
ploys compact digital hardware. This fact motivates the introduction of the model, 
though it will not play an important part in this paper. 
2 Definitions and Basic Results 
A boolean circuit is a directed acyclic graph whose nodes are referred to as gates, 
with a single output node of out-degree zero. The nodes with in-degree zero are 
termed input nodes. The nodes that are not input nodes are computational nodes. 
There is a boolean function associated with each computational node of arity equal 
to its in-degree. The function computed by a boolean network is determined by 
assigning (input) values to its input nodes and performing the function at each 
computational node once its input values are determined. The result is the value 
at the output node. The class TCo is defined to be those functions that can be 
computed by a family of polynomially sized Boolean circuits with unrestricted fan- 
in and constant depth, where the gates are either NOT or THRESHOLD. 
In order to use the boolean circuits to compute with real numbers we use the method 
of stochastic computing to encode real numbers as bit streams. The encoding we 
will use is to consider the stream of binary bits, for which the 1's are generated 
independently at random with probabihty p, as representing the number p. This 
is referred to as a Bernoulli sequence of probability p. In this representation, the 
multiphcation of two independently generated streams can be achieved by a simple 
AND gate, since the probability of a 1 on the output stream is equal to plp2, where 
pl is the probability of a i on the first input stream and p2 is the probabihty of 
a i on the second input stream. Hence, in this representation the boolean circuit 
consisting of a single AND gate can compute the product of its two inputs. 
More background information about stochastic computing can be found in the work 
of Gaines [1]. The analysis we provide is made by treating the calculations as exact 
real valued computations. In a practical (hardware) implementation real bit streams 
would have to be generated [3] and the question of the accuracy of a delivered result 
arises. 
In the applications considered here the output values are used to determine a binary 
value by comparing with a threshold of 0.5. Unless the actual output is exactly i or 
0 (which can happen), then however many bits are collected at the output there is a 
slight probabihty that an incorrect classification will be made. Hence, the number 
of bits required is a function of the difference between the actual output and 0.5 
and the level of confidence required in the correctness of the classification. 
Definition I The real function computed by a boolean circuit C, which computes 
the boolean function 
fc: {0, 1} ' {0, 1}, 
is the function 
gc: [0, 1] ' ) [0, 1], 
Generalisation of a Class of Continuous Neural Networks 269 
obtained by coding each input independently as a Bernoulli sequence and interpreting 
the output as a similar sequence. 
Hence, by the discussion above we have for the circuit C consisting of a single AND 
gate, the function gc is given by gc(xx, x9.) = xlx9.. 
We now give a proposition showing that the definition of real computation given 
above is well-defined and generalises the Boolean computation performed by the 
circuit. 
Proposition 2 The bit stream on the output of a boolean circuit computing a real 
function is a Bernoulli sequence. The real function gc computed by an n input 
boolean circuit C can be expressed in terms of the corresponding boolean function 
fc as follows: 
where 
P(a) - H x?'(1 - xi) (1-'). 
In particular, 
=fc. 
Proof: The output bit stream is a Bernoulli sequence, since the behaviour at each 
time step is independent of the behaviour at previous time sequences, assuming the 
input sequences are independent. Let the probability of a i in the output sequence 
be p. Hence, gc(x): p. At any given time the input to the circuit must be one 
of the 2 n possible binary vectors c. Pz(c) gives the probability of the vector c 
occurring. Hence, the expected value of the output of the circuit is given in the 
proposition statement, but by the properties of a Bernoulli sequence this value is 
also p. The final claim holds since P.(a) = 1, while P.(a') = 0 for a y a'.. 
Hence, the function computed by a circuit can be denoted by a polynomial of degree 
n, though the representation given above may involve exponentially many terms. 
This representation will therefore only be used for theoretical analysis. 
3 Bit Stream Neural Networks 
In this section we describe a neural network model based on stochastic computing 
and show that it corresponds to taking TCo circuits in the framework considered 
in Section 2. 
A Stochastic Bit Stream Neuron is a processing unit which carries out very simple 
operations on its input bit streams. All input bit streams are combined with their 
corresponding weight bit streams and then the weighted bits are summed up. The 
final total is compared to a threshold value. If the sum is larger than the threshold 
the neuron gives an output 1, otherwise 0. 
There are two different versions of the Stochastic Bit Stream Neuron corresponding 
to the different data representations. The definitions are given as follows. 
Definition 3 (AND-SBSN): A n-input AND version Stochastic Bit Stream Neu- 
ron has n weights in the range [-1,1] and n inputs in the range [0,1], which are all 
unipolar representations of Bernoulli sequences. An extra sign bit is attached to 
each weight Bernoulli sequence. The threshold 0 is an integer lying between -n to n 
which is randomly generated according to the threshold probability density function 
eft(O). The computations performed during each operational cycle are 
2 70 J. SHAWE-TAYLOR, J. ZHAO 
(1) combining respectively the n bits from n input Bernoulli sequences with the 
corresponding n bits from n weight Bernoulli sequences using the AND operation. 
(2) assigning n weight sign bits to the corresponding output bits of the AND gate, 
summing up all the n signed output bits and then comparing the total with the 
randomly generated threshold value. If the total is not less than the threshold value, 
the AND-SBSN outputs 1, otherwise it outputs O. 
We can now present the main result characterising the functionality of a Stochastic 
Bit Stream Neural Network as the real function of an TCo circuit. 
Theorem 4 The functionality of a family of feedforward networks of Bit Stream 
Neurons with constant depth organised into layers with interconnections only be- 
tween adjacent layers corresponds to the function gc for an TCo circuit C of depth 
twice that of the network. The number of input streams is equal to the number 
of network inputs while the number of parameters is at most twice the number of 
weights. 
Proof: Consider first an individual neuron. We construct a circuit whose real 
functionality matches that of the neuron. The circuit has two layers. The first 
consists of a series of AND gates. Each gate links one input line of the neuron 
with its corresponding weight input. The outputs of these gates are linked into a 
threshold gate with fixed threshold 2d for the AND-SBSN, where d is the number 
of input lines to the neuron. The threshold distribution of the AND SBSN is now 
simulated by having a series of 2d additional inputs to the threshold gate. The 
number of additional input streams required to simulate the threshold depends on 
how general a distribution is allowed for the threshold. We consider three cases: 
1. If the threshold is fixed (i.e. not programmable), then no additional inputs 
are required, since the actual threshold can be suitably adapted. 
2. If the threshold distribution is always focussed on one value (which can 
be varied), then an additional Ilogv.(2d)] ([logv.(d)]) inputs are required to 
specify the binary value of this number. A circuit feeding the corresponding 
number of 1's to the threshold gate is not hard to construct. 
3. In the fully general case any series of 2d d- i (d d- 1) numbers summing to 
one can be assigned as the probabilities of the possible values 
b(0), b(1),..., b(t), 
where t = 2d for the AND SBSN. We now construct a circuit 
which takes t input streams and passes the 1-bits to the threshold 
gate of all the inputs up to the first input stream carrying a 0. 
No further input is passed to the threshold gate. In other words 
Threshold gate receives s <= Input streams 1,..., s have bit i and 
bits of input either s = t or input stream s + 1 has 
input 0. 
We now set the probability p, of stream s as follows; 
pl : 1-b(0) 
1- -]i=o b(i) 
Ps ---- 
s--. 
- 
for s-- 2,...,t 
With these values the probability of the threshold gate receiving s bits is 
4(s) as required. 
Generalisation of a Class of Continuous Neural Networks 2 71 
This completes the replacement of a single neuron. Clearly, we can replace all 
neurons in a network in the same manner and construct a network with the required 
properties provided connections do not 'shortcut' layers, since this would create 
interactions between bits in different time slots.  
4 VC Dimension and Sample Sizes 
In order to perform a VC Dimension and sample size analysis of the Bit Stream 
Neural Networks described in the previous section we introduce the following general 
framework. 
Definition 5 For a set G of smooth functions f: 7U  x 7 z --> 7, the class .T is 
defined as 
5 c -- 5c -- {f, Ifo(m)= f(a,w),f E6). 
The corresponding classification class obtained by taking a fixed set of s of the func- 
tions from 6, thresholding the corresponding functions from .T at 0 and combin- 
ing them (with the same parameter vector) in some logical formula will be denoted 
We will denote by 
In our case we will consider a set of circuits C each with n +  input connections, n 
labelled as the input vector and  identified as parameter input connections. Note 
that if circuits have too few input connections, we can pad them with dummy ones. 
The set  will then be the set 
6 -- 6c -- {gclC }, 
while 2V6c will be denoted by 2ve. 
We now quote some of the results of [7] which uses the techniques of Karpinski and 
MacIntyre [4] to derive sample sizes for classes of smoothly parametrised functions. 
Proposition 6 [7] Let  be the set of polynomials p of degree at most d with 
p: 72,. n x 72,/ -- 72. and 
Hence, there are  adjustable parameters and the input dimension is n. Then the 
VC-dimension of the class H(.Tc) is bounded above by 
log.(2(2d) t) + 171og.(s). 
Corollary 7 For a set of circuits C, with n input connections and  parameter 
connections, the VC-dimension of the class H(.Tc) is bounded above by 
I + (1 + log.(n + ) + 17 log.(s)). 
Proof: By Proposition 2 the function gc computed by a circuit C with t input 
connections has the form 
{0,1}  
where 
i---1 
Hence, gc(x) is a polynomial of degree t. In the case considered the number t of 
input connections is n + . The result follows from the proposition.  
2 72 J. SHAWE-TAYLOR, J. ZHAO 
Proposition 8 [7] Let 6 be the set of polynomials p of degree at most d with 
p : 7 n x 7 t - 7 and 
Hence, there are  adjustable parameters and the input dimension is n. If a function 
h E Hs(.T) correctly computes a function on a sample of m inputs drawn indepen- 
dently according to a fixed probability distribution, where 
1[ (4e v/)(2/(-1))] 
re>too(e, 5)_ e(1 -v ) 21n +In  
then with probability at least i - 5 the error rate of h will be less than e on inputs 
drawn according to the same distribution. 
Corollary 9 For a set of circuits C, with n input connections and  parameter 
connections, If a function h E Hs(.Tc ) correctly computes a function on a sample 
of m inputs drawn independently according to a fixed probability distribution, where 
1[ /4ev/s( + ))/2/(- 1))] 
m > mo(e, 5 )---- e(1-v/) 21n +In 
then with probability at least i - 5 the error rate of h will be less than e on inputs 
drawn according to the same distribution. 
Proof: As in the proof of the previous corollary, we need only observe that the 
functions g for C E C are polynomials of degree at most n + .  
Note that the best known sample sizes for threshold networks are given in [6]: 
m > too(e, 5 ) = e(1 -- v/) 2Wln + In , 
where W is the number of adaptable weights (parameters) and N is the number 
of computational nodes in the network. Hence, the bounds given above are almost 
identical to those for threshold networks, despite the underlying techniques used to 
derive them being entirely different. 
One surprising fact about the above results is that the VC dimension and sample 
sizes are independent of the complexity of the circuit (except in as much as it must 
have the required number of inputs). Hence, additional layers of fixed computation 
cannot increase the sample complexity above the bound given). 
5 Simulation Results 
The Monk's problems which were the basis of a first international comparison of 
learning algorithms, are derived from a domain in which each training example 
is represented by six discrete-valued attributes. Each problem involves learning a 
binary function defined over this domain, from a sample of training examples of 
this function. The 'true' concepts underlying each Monk's problem are given by: 
MONK-l: (attributel = attribute.) 
or (attribute5 = 1) 
MONK-2: (attributei = 1) 
for EXACTLY TWO i  {1, 2, ..., 6} 
MONK-3:(attribute5 = 3 and attribute4 = 1) 
or (attribute5  4 and attribute.  3) 
Generalisation of a Class of Continuous Neural Networks 2 73 
There are 124, 169 and 122 samples in the training sets of MONK-l, MONK-2 and 
MONK-3 respectively. The testing set has 432 patterns. The network had 17 input 
units, 10 hidden units, i output unit, and was fully connected. Two networks were 
used for each problem. The first was a standard multi-layer perceptron with sigmoid 
activation function trained using the backpropagation algorithm (BP Network). 
The second network had the same architecture, but used bit stream neurons in place 
of sigmoid ones (BSN Network). The functionality of the neurons was simulated 
using probability generating functions to compute the probability values of the bit 
streams output at each neuron. The backpropagation algorithm was adapted to 
train these networks by computing the derivative of the output probability value 
with respect to the individual inputs to that neuron [8]. 
Experiments were performed with and without noise in the training examples. 
There is 5% additional noise (misclassifications) in the training set of MONK-3. 
The results for the Monk's problems using the moment generating function simula- 
tion are shown as follows: 
BP Network BSN Network 
training testing training testing 
MONK-1 100% 86.6% 100% 97.7% 
MONK-2 100% 84.2% 100% 100% 
MONK-3 97.1% 83.3% 98.4% 98.6% 
It can be seen that the generalisation of the BSN network is much better than 
that of a general multilayer backpropagation network. The results on MONK-3 
problem is extremely good. The results reported by Hassibi and Stork [2] using a 
sophisticated weight pruning technique are only 93.4% correct for the training set 
and 97.2% correct for the testing set. 
References 
[1] B. R. Gaines, Stochastic Computing Systems, Advances in Information Sys- 
tems Science 2 (1969) pp37-172. 
[2] B. Hassibi and D.G. Stork, Second order derivatives for network pruning: Op- 
timal brain surgeon, Advances in Neural Information Processing System, Vol 
5 (1993) 164-171. 
[3] P. Jeavons, D.A. Cohen and J. Shawe-Taylor, Generating Binary Sequences 
for Stochastic Computing, IEEE Trans on Information Theory, 40 (3) (1994) 
716-720. 
[4] M. Karpinski and A. MacIntyre, Bounding VC-Dimension for Neural Networks: 
Progress and Prospects, Proceedings of EuroCOLT'95, 1995, pp. 337-341, 
Springer Lecture Notes in Artificial Intelligence, 904. 
[5] P. Koiran, A Weak Version of the Blum, Shub and Smale Model, ESPRIT 
Working Group NeuroCOLT Technical Report Series, NC-TR-94-5, 1994. 
[6] J. Shawe-Taylor, Threshold Network Learning in the Presence of Equivalences, 
Proceedings of NIPS 4, 1991, pp. 879-886. 
[7] J. Shawe-Taylor, Sample Sizes for Sigmoidal Networks, to appear in the Pro- 
ceedings of Eighth Conference on Computational Learning Theory, COLT'95, 
1995. 
[8] John Shawe-Taylor, Peter Jeavons and Max van Daalen, "Probabilistic Bit 
Stream Neural Chip: Theory", Connection Science, Vol 3, No 3, 1991. 
