Summed Weight Neuron Perturbation: An O(N) 
Improvement over Weight Perturbation. 
Barry Flower and Marwan Jabri 
SEDAL 
Department of Electrical Engineering 
University of Sydney 
NSW 2006 Australia 
Abstract 
The algorithm presented performs gradient descent on the weight space 
of an Artificial Neural Network (ANN), using a finite difference to 
approximate the gradient. The method is novel in that it achieves a com- 
putational complexity similar to that of Node Perturbation, O(N3), but 
does not require access to the activity of hidden or intemal neurons. 
This is possible due to a stochastic relation between perturbations at the 
weights and the neurons of an ANN. The algorithm is also similar to 
Weight Perturbation in that it is optimal in terms of hardware require- 
ments when used for the training of VLSI implementafiom of ANN's. 
1 INTRODUCTION 
Optimization of the weights of an ANN may be performed by, the application of a gradi- 
ent descent technique. The gradient may be calculated directly as in Backpropagation, or it 
may be approximated by a Finite Difference Method which is what we concem ourselves 
with in this paper. These methods lend themselves to the task of training hardware imple- 
mentations of ANNs where real estate is at a premium and synaptic density is of great 
importance. Neuron Perturbation (NP), as described by the Madaline Rule 1II (MRlll) 
0Nidrow and Lehr, 1990), is a technique that approximates the gradient of the Mean 
Square Error (MSE) with respect to the change at a given neuron by applying a small per- 
turbation to the input of the neuron and measuring the change in the MSE. The weight 
8E 
Awij -- -l.  .xj, 
Onet. 
! 
update is then calculated from the product of this gradient measure and the activation of 
212 
Summed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation 213 
the neuron from which the weight is fed, as described by (1). 
Weight Perturbation (WP), as described by Jabfi and Flower (Jabfi and Flower, 1992) is a 
neural network training techniques based on gradient descent using a Finite Difference 
method to approximate the gradient. The gradient of the MSE with respect to a weight is 
approximated by applying a small permbafion to the weight and measuring the change in 
the MSE. This gradient is then used to calculated the weight ulxlate such that: 
(2) 
The advantages of WP over NP are that it performs better when limited precision weights 
are used, as shown by Xie and Jabfi (Xie and Jabfi, 1992), and is optimal with respect to 
hardware requirements when used to train VLSI implementations of ANNs. However, WP 
has O(N 4) computational complexity whilst NP has O(N 3) computational complexity. 
Summed Weight Neuron Perturbation (SWNP) is similar to NP in that it has a computa- 
tional complexity of O(N 3) but it has the added advantage that the activation of intemal 
neurons does not need to be known. The cost of this reduced computational complexity is 
that SWNP needs to save the perturbation vector used. 
In the following sections a description of the SWNP algorithm is provided and, finally, 
some experimental results are presented. 
2 
THE SUMMED WEIGHT NEURON PERTURBATION 
ALGORITHM 
A subsection of a feedforward ANN containing N neurons is shown in Figure 1. on which 
nomenclature the following derivation is based. 
r.m 1 
k 
FIGURE 1: Description Of Indices Used To Describe The Neurons Weights And 
Perturbations In An ANN. 
In a feedforward network of size N neurons the activation of a given neuron is determined 
by: 
xi(P) ---- fi(neti(P) ) , and (3) 
tteti(P) -- WilX l (p) , 
and f/(y) is the ith neuron transfer function, X i (p) is the activation of the ith neuron for 
the pth pattern, and wit is the weight connecting the/th neuron's output to the ith neuron's 
input. The error function, (MSE), is defmed as in (4), where T is the set of output neurons 
and d: (p) is the expected value of the output on the kth neuron. The change in E (p) 
214 Flower and Jabri 
with respect to a given weight may then be expressed as (5). 
1 
E(p) = :jr(d:(p) -x:(p) )2 (,l) 
(p) (p) 
}w i j -- i}neti (P ) .xj (p) . (5) 
The first term of on the right-hand side of (5) can be determined using a Finite Difference, 
which in this case is a Forward Difference, so that: 
8E (p) aEr (P) 
i}net i (p) F i 
+ O (ri), (s) 
where, 
AEF. (p) = E r (p) - E (p), {7) 
and F i is the perturbation applied to the ith neuron, Eri (p) is the error for the pth pattem 
with a perturbation applied to the ith neuron and E (p) is the error for the pth pattem 
without a perturbation applied to any neurons. The error introduced by the approximation 
is represented by the last term on the fight-hand side in (6). 
The perturbation of one or more of the weights that are inputs to the qth neuron can be 
thought of as being equal to some perturbation applied directly to that neuron. Hence: 
Fq ---- qlXl (p), 
(0) 
where Tq I is the perturbation applied to weight Wql. AS will be shown, perturbing the qth 
neuron by perturbing all the weights feeding into it, enables the sign of the gradient 
aE (p) 
to be determined without performing the product on the right-hand side of (5). 
Further more, the activation of hidden neurons, (i.e. xj (p) in (5)) need not be known. The 
contribution of the perturbation of weight wij to the perturbation of the ith neuron is 
ijxj (p). (9) 
Let us take the degenerate case where there is only one weight for the ith neuron. Then the 
gradient of the MSE with respect to weight wij is: 
8E (p) AEr (p) xj (p) AEr, (p) xj (p) 
- + O (Fi) = + O (Fi) 
v ij I"i ij'Xj (P) 
AEr.(P) 
+0 (Fi) , 
(10) 
Summed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation 215 
noting that xj (p) has been eliminated. In the general case where the ith neuron has more 
than one weight the gradient with respect to weight wq is shown in (11). 
aE(p) 
- + 0 (Fi) 
ij Fi 
aEr.(p) 
+o(r) 
(11) 
where, 
- t (12) 
xj (p)  
The form of (10) and (11) are the same and it will be shown that ij can be substituted for 
Pij in (11) due to a stochastic relationship between them. 
Let us represent the sign of ij and Pij as either +1 or -1 such that: 
Pij - ij and eg iJ 
T/j v/j- ij. 0 . (13) 
The set of all possible states for the system represented by the vector (!x/j, vij), assuming 
ij and egtj are never zero, is: 
{ (-1,-1), (-1, 1), (1,-1), (1, 1) }. (14) 
and it can be seen that when Ixij = vii then the sign of the gradient of the MSE with 
respectto weight Wij given by (10) is the same as that given by (11). If the sign of ij is 
chosen randomly then the probability of Ixij = vij being true is 0.5, from (14), and so (10) 
will generate a gradient that is in the correct direction 50% of the time. This in itself is not 
sufficient to allow the network to be trained as it will take as many steps in the incorrect 
direction as the correct direction ff the steps themselves are of the same size, (i.e. the mag- 
nitude of F i is the same for a step in the correct direction as a step in the incorrect direc- 
tion). 
Fortunately it can be shown that the size of the steps in the correct direction are greater 
than those in the incorrect direction. Let us take the case where a particular ij is chosen 
such that 
Ixij = vij' 
Now by substituting (8), (12) and (13) into (15) we get: 
(15) 
216 Flower and Jabri 
rearranging to give, 
j 
j 
Iifxjl 
Tifxj 
= (16) 
ikxk (p) 
= , (17) 
TikXk 
which implies that the contribution to F i made by the perturbation ij is of the same sign 
as r i. Let us designate this neuron perturbation as F i (A). Now we take the other possible 
case where, 
[tij #:  ij' (18) 
assuming every other parameter is the same, and only the sign of /j is changed. The 
equality in (17) is now untrue and the contribution to F i made by the perturbation 70 is of 
the opposite sign as F i. Let us designate this neuron perturbation as F i (B). From (8) we 
can determine that, 
ir(A) I -- ri(B)l + 2 'Yij' (19) 
Equation (19) shows the relatiouship between the two possible states of the system where 
F i (A) represents the slimmed neuron perturbation for a selected weight perturbation ij 
that generates a step in the corrected direction and F i (B) is similar but for a step in the 
incorrect direction. Qearly the correct step is always calculated from an approximated 
gradient that is larger than that for an incorrect step as the neuron perturbation is larger. 
The weight update rule then becomes: 
AEr. (P) 
Awij -- -ll. ' (20) 
The algorithm for SWNP is shown as pseudo code in Figure 2. 
2.1 HARDWARE COMPATIBILITY OF SWNP 
This optimisation technique is ideally suited to the training of hardware implementations 
of ANN's whether they consist of discrete components or are VLSI technology. The speed 
up over WP of O (N) achieved is at the cost of an 0 (N) storage requirement but this 
sto_"age can be achieved with a single bit per neuron. SWNP is the same order of complex- 
ity as NP but does not require access to the activation of intemal neurons and therefore can 
treat a network as a "black box" into which an input vector and weight matrix is fed and an 
Summed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation 217 
output vector is received. 
While (total error > error threshold) { 
For (all patems in training set) { 
Select next pattern and training vector; 
Forward Prop. ;Measure, (calculate) and save error; 
Accumulate total error; 
For (all non-input neurons) { 
For (all weights of current neuron) { 
| Apply & Save perturbation of random polarity; 
Forward Prop. ;Measure, (calculate) and save Aerror; 
For (all weights of current neuron) { 
Restore value of weight; 
Calculate weight delta using saved perturbation value; 
} If (Online Mode) Update current weight; 
If (Online Mode) 
} Forward Prop.; Measure, (calculate) and save new error; 
If (Batch Mode) I 
For (all weights) 
} Update current weight; 
FIGURE 2: Algorithm in Pseudo Code for Summed Weight Neuron Perturbation. 
3 TEST RESULTS USING SWNP 
The results for a series of tests are shown in the next three tables and are snmmarised in 
Figure 4. The headings are, N the number of neurons in the network, P the number of pat- 
terns in the training set, FF-SWNP the number of feedforward passes for the SWNP 
Algorithm, FF-WP the number of feedforward passes for the WP Algorithm, and RATIO 
the ratio between the number of feedforward passes for WP against SWNP. The feedfor- 
ward passes are recorded to 1 significant figure. 
The results for a series of simulations comparing the performance of SWNP against WP 
are shown in Table 1. The simulations utilised floating point synapfic and neuron preci- 
sions. 
The results for a series of simulations comparing the performance of SWNP against WP 
are shown in Table 2. The simulations utilised limited synapfic precision, (i.e. 6 bits) and 
floating point neuron precisions 
The results for a series of experiments comparing the performance of SWNP against WP 
are shown in Table 2. Note: the training algorithm are the variations of WP and SWNP 
that are combined with the Random Search Algorithm (RSA). The results reported are 
averaged over 10 trials. 
An example of the training error trajectories of WP and SWNP for the Monk 2 problem 
are shown in Figure 3. 
218 Flower and Jabri 
Table 1: Performance Of SWMP Versus WP, Comparing Feedforward Operations To 
Convergence. (Simulations With Floating Point Precision) 
PROBLEM N P FF-SWNP FF-WP ERROR RATIO 
XOR 3 4 1.6x103 1.9x103 0.0125 1.22 
4 Encoder 5 4 0.9x10  1.8x10  0.0125 1.84 
8 Encoder 11 8 1.Sx10  4.5x10  0.0125 2.88 
ICEG 15 119 3.7x10  7.9x10  0.0125 21.34 
Table 2: Performance Of SWNP Versus WP, Comparing Feedforward Operations To 
Convergence. (Simulations With Limited Precision) 
PROBLEM N P SWNP WP ERROR RATIO 
Monkl 4 129 1.0x10  1.9x106 0.001 19.38 
Monk2 17 169 3.6x10  6.8x106 0.0005 18.71 
Monk3 17 122 1.2x106 7.1x106 0.022 5.87 
IECG 55 5 8 1.6x10 n 7.2x104 0.0001 4.2 
Table 3: Performance Of SWNP Versus WP, Comparing Feedforward Operations To 
Convergence. (Hardware Implementation) 
PROBLEM N P SWNP WP ERROR RATIO 
ECG 55 5 8 3.1x10  3.6x10  0.00001 1.13 
ECG 045 5 8 1.1x10 n 2.0x104 0.001 1.78 
MON! 2 PROBLEI[ 
FIGURE 3: Comparison of WP and SWNP For Monk 2 Problem 
Summed Weight Neuron Perturbation: An (O)N Improvement over Weight Perturbation 219 
FIGURE 4: 
Comparison of the number of Feedforward passes performed to achieve 
convergence on a range of problems using SWNP and WP. 
-- 20 
.- 18 
-- 16 
-- 14 
-. 12 
.- 10 
.-8 
.-6 
.-4 
.-2 
xo'x?0 
4 
8 ENCODERx 11 
,-.' SWNP Feedforward Passes 
 WP Feedforward Passes 
MONKlxl( 
ICEG 
ECG 
ECG 
4 CONCLUSION 
The algorithm presented, SWNP, performs gradient descent on the weight space of an 
ANN, using a finite difference to approximate the gradient. The method is novel in that it 
achieves O (N ) computational complexity similar to that of Node Perturbation but does 
not require access to the activity of hidden or internal neurons. The algorithm is also simi- 
lar to Weight Perturbation in that it is optimal in terms of hardware requirements when 
used for the training of VLSI implementations of ANN's. Results are presented that show 
the algorithm in operation on floating point simulations, limited precision simulations and 
an actual hardware implementation of an ANN. 
References 
Jabri, M. and Flower, B. (1992). Weight perturbation: An optimal architecture and leaming 
technique for analog vlsi feedforward and recurrent multilayer networks. IEEE 
Transactions on Neural Networks, 3 (1): 154-157. 
Widrow, B. and Lehr, M. A. (1990). 30 years of adaptive neural networks: Perceptron, 
madaline, and backpropagation. Proceedings of the IEEE, 78(9): 1415-1442. 
Xie, Y. and Jabri, M. (1992). Analysis of the effects of quantization in multilayer neural 
networks using a statistical model. IEEE Transactions on Neural Networks, 
3(2):334-338. 
