Interior Point Implementations of 
Alternating Minimization Training 
Michael Lemmon 
Dept. of Electrical Engineering 
University of Notre Dame 
Notre Dame, IN 46556 
lemmon@maddog.ee.nd.edu 
Peter T. Szymanski 
Dept. of Electrical Engineering 
University of Notre Dame 
Notre Dame, IN 46556 
pszymans@maddog.ee.nd.edu 
Abstract 
This paper presents an alternating minimization (AM) algorithm 
used in the training of radial basis function and linear regressor 
networks. The algorithm is a modification of a small-step interior 
point method used in solving primal linear programs. The algo- 
rithm has a convergence rate of O(v/-dL) iterations where n is a 
measure of the network size and L is a measure of the resulting 
solution's accuracy. Two results are presented that specify how 
aggressively the two steps of the AM may be pursued to ensure 
convergence of each step of the alternating minimization. 
I Introduction 
In recent years, considerable research has investigated the use of alternating min- 
imization (AM) techniques in the supervised training of radial basis function 
networks. AM techniques were first introduced in soft-competitive learning 
gorithms[1]. This training procedure was later shown to be closely related to 
Expectation-Maximization algorithms used by the statistical estimation commu- 
nity[2]. Alternating minimizations search for optimal network weights by breaking 
the search into two distinct minimization problems. A given network performance 
functional is extremalized first with respect to one set of network weights and then 
with respect to the remaining weights. These learning procedures have found ap- 
plications in the training of local expert systems [3], and in Boltzmann machine 
training [4]. More recently, convergence rates have been derived by viewing the AM 
5 70 Michael Lemmon, Peter T. Szymanski 
method as a variable metric algorithm [5]. 
This paper examines AM as a perturbed linear programming (LP) problem. Recent 
advances in the application of barrier function methods to LP problems have re- 
sulted in the development of "path following" or "interior point" (IP) algorithms [6]. 
These algorithms are characterized by a fast convergence rate that scales in a sub- 
linear manner with problem size. This paper shows how a small-step IP algorithm 
for solving a primal LP problem can be modified into an AM training procedure.. 
The principal results of the paper are bounds on how aggressively the legs of the 
alternating minimization can be pursued so that the AM algorithm maintains the 
sublinear convergence rate characteristic of its LP counterpart and so that both legs 
converge to an optimal solution. 
2 Problem Statement 
Consider a function approximation problem where a stochastic approximator learns 
a mapping f: lR N - lR. The approximator computes a predicted output, 0 E lR, 
given the input z E lR N. The prediction is computed using a finite set of M 
regressors. The rn h regressor is characterized by the pair (Om,m)  lR N x lR N 
(m - 1,..., M). The output of the rn h regressor, 0m  lR, in response to an input, 
z  lR N is given by the linear function 
9m =0mz. (1) 
The m *' regressor (m = 1,...,M) has an associated radial basis function (RBF) 
with parameter vector Wm  IR N. The m t' RBF weights the contribution of the 
m t' output in computing 0 and is defined as a normal probability density function 
Q(mlz) =km exp(-r-2 [[Om - zll 2) (2) 
where km is a normalizing constant. The set of all weights or gating probabilities is 
denoted by Q. The parameterization of the m t' regressor is Om= (oTto ,WTm)T  IR 2N 
(m - 1,..., M) and the parameterization of the set of M linear regressors is 
e = (3) 
The preceding stochastic approximator can be viewed as a neural network. The 
network consists of M + 1 neurons. M of the neurons are agent neurons while the 
other neuron is referred to as a gating neuron. The rn *h alent neuron is parame- 
terized by Ore, the first element of the pair Om- (OTm,WTm) '' (m = 1, ...,M). The 
agent neurons receive as input, the vector z 6 lR N. The output of the rn ta agent 
neuron in response to an input z is 0m = 0Tin z (m = 1,..., M). The gating neuron is 
parameterized by the conditional gating probabilities, Q. The gating probabilities 
are defined by the set of vectors, & = {wl,..., WM}. The gating neuron receives the 
agent neurons' outputs and the vector z as its input. The gating neuron computes 
the network's output, 0, as a hard (4) or soft (5) choice 
0- 0m; m = arg max Q(mlz ) (4) 
m=l,...,M 
.0 = E"""'v-:I g(mlz).O.,., (5) 
M 
Q(mlz) 
Interior Point Implementations of Alternating Minimization Training 571 
The network will be said to be "optimal" with respect to a training set T = {(zi, Yi): 
Yi -- f(zi), i -- 1,..., R} if a mean square error criterion is minimized. Define the 
square output error of the rn th agent neuron to be em(Zi) -- (Yi -- oTmzi) 2 and the 
square weighting or classifier error of the rn th RBF to be cm(zi) = IIw,-zill 2. Let 
the combined square approximation error of the rn ta neuron be dm (zi) = eem (Zi) + 
cCm(Zi) and let the average square approximation error of the network be 
M R 
d(Q, O, T): y. y.'.p(zi)Q(mlzi)dm(zi ). (6) 
m--1 i----1 
Minimizing (6) corresponds to minimizing both the output error in the M linear 
regressors and the classifier error in assigning inputs to the M regressors_. Since T 
is a discrete set and the Q are gating probabilities, the minimization of d(Q, O, T) 
is constrained so that Q is a valid conditional probability mass function over the 
training set, T. 
Network training can therefore be viewed as a constrained optimization problem. 
In particular, this optimization problem can be expressed in a form very similar 
to conventional LP problems. The following notational conventions are adopted to 
highlight this connection. Let x  IP MR be the gating neuron's weight vector where 
x = (Q(llzx),..., Q(llzn), 
(7) 
Let O = (oT,aT) T  lR 2N denote the parameter vectors associated with the m  
regressor and define the cost vector conditioned on O - (oT,..., Ot) T as 
c(O) = (p(Zl)dl(Zl),...,p(zn)dl(Zn),p(z2)d2(z2),...,p(zi)dm(zi),...) T (8) 
With this notation, the network training problem can be stated as follows, 
minimize c"(O)x 
with respect to x, O (9) 
subject to Ax - b, x > 0 
where b = (1,..., 1) "  lB, n, A = [Inxn'" Inxn]  IR nxMn, and x >_ 0 implies 
xi >_ 0 for i = 1,...,MR. 
One approach for solving this problem is to break up the optimization into two 
steps. The first step involves minimizing the above cost functional with respect to 
x assuming a fixed O. This is the Q-update of the algorithm. The second leg of the 
algorithm minimizes the functional with respect to O assuming fixed x. This leg 
is called the O-update. Because the proposed optimization alternates between two 
different subsets of weights, this training procedure is often referred to as alternating 
minimization. Note that the Q-update is an LP problem while the O-update is a 
quadratic programming problem. Consequently, the AM training procedure can be 
viewed as a perturbed LP problem. 
3 Proposed Training Algorithm 
The preceding section noted that network training can be viewed as a perturbed 
LP problem. This observation is significant for there exist very efficient LP solvers 
5 72 Michael Lemmon, Peter T. Szymanski 
based on barrier function methods used in non-linear optimization. Recent advances 
in path following or interior point (IP) methods have developed LP solvers which 
exhibit convergence rates which scale in a sublinear way with problem size [6]. This 
section introduces a modification of a small-step primal IP algorithm that can be 
used for neural network training. The proposed modification is later shown to 
preserve the computational efficiency enjoyed by its LP counterpart. 
To see how such a modification might arise, we first need to examine path following 
LP solvers. Consider the following LP problem. 
minimize cTx 
with respect to x  lR" (10) 
subject to Ax=b,x_>0 
This problem can be solved by solving a sequence of augmented optimization prob- 
lems arising from the primal parameterization of the LP problem. 
minimize O(k)cTx(k) - Y'-i log x? ) 
with respect to x (k) 6 IR n (11) 
subject to Ax  = b, x() _> 0 
where a() > 0 (k = 1,..., K) is a finite length, monotone increasing sequence of 
real numbers. x*(a()) denotes the solution for the kth optimization problem in 
the sequence and is referred to as a central point. The locus of all points, x* (c ) 
where a() >_ 0 is called the central path. The augmented problem takes the original 
LP cost function and adds a logarithmic barrier which keeps the central point away 
from the boundaries of the feasible set. As a increases, the effect of the barrier 
is decreased, thereby allowing the ktn central point to approach the LP problem's 
solution in a controlled manner. 
Path following (IP) methods solve the LP problem by approximately solving 
the sequence of augmented problems shown in (11). The parameter sequence, 
c(), c0),..., a(K), is chosen to be a monotone increasing sequence so that the 
central points, x*(a(k)), of each augmented optimization approach the LP solution 
in a monotone manner. It has been shown that for specific choices of the c sequence, 
that the sequence of approximate central points will converge to an e-neighborhood 
of the LP solution after a finite number of iterations. For primal IP algorithms, the 
required condition is that successive approximations of the central points lie within 
the region of quadratic convergence for a scaling steepest descent (SSD) algorithm 
[6]. In particular, it has been shown that if the ktn approximating solution is suffi- 
ciently close to the k t central point and if c (+1) = c()(1 + U/v/ ) where u _ 0.1 
controls the distance between successive central points, then the "closeness" to the 
(k + 1) st central point is guaranteed and the resulting algorithm will converge in 
O(v/L) iterations where L = n +p+ 1 specifies the size of the LP problem and p is 
the total number of bits used to represent the data in A, b, and c. If the algorithm 
takes small steps, then it is guaranteed to converge efficiently. 
The preceding discussion reveals that a key component to a path following method's 
computational efficiency lies in controlling the iteration so that successive central 
points lie within the SSD algorithm's region of quadratic convergence. If we are to 
successfully extend such methods to (9), then this "closeness" of successive solutions 
must be preserved by the O-update of the algorithm. Due to the quadratic nature 
Interior Point Implementations of Alternating Minimization Training 573 
of the O-update, this minimization can be done exactly using a single Newton- 
Raphson iteration. Let 0* denote 0-update's minimizer. If we update 0 to 0*, 
it is quite possible that the cost vector, c(0), will be rotated in such a way that 
the current solution, x(k), no longer lies in the region of quadratic convergence. 
Therefore, if we are to preserve the IP method's computational efficiency it will be 
necessary to be less "aggressive" in the O-update. In particular, this paper proposes 
the following convex combination as the O-update 
O( +) - (1 - 7(k))O ) + 7()O(m +)'* (12) 
where O ) is the m h parameter vector at time k and 0 ( 7  ( 1 controls the size 
of the update. This will ensure convergence of the Q-update. 
Convergence of the AM algorithm also requires convergence of the O-update. For 
the O-update to converge, 7  in (12) must go to unity as k increases. Convergence 
of 7() to unity requires that the sequence iIO+x),*-O)ll be monotone decreasing. 
As the O-update minimizer, 0 (+x),*, depends upon the current weights, Q(k)(mlz), 
large changes to Q can prevent the sequence from being monotone decreasing. Thus, 
it is necessary to also be less "aggressive" in the Q-update. An appropriate bound 
on v is the proposed solution to guarantee convergence of the O-update. 
Algorithm 1 (Proposed Training Algorithm) 
Initialize 
k=0. 
Choose x! k), O ), and a  for i---1,...,(MR), 
repeat 
o (+) = a()(1 + v/V/), ,here , _ 0.1. 
Q-update: 
Xo = x () 
for i=O,...,P- 1 
(k+) 
xi+ = ScalingSteepestD.escent(x i , a(+), 0 ) 
X (+) = Xp 
O-update: For m=l,...,M 
= (1 - + 
k=k+l 
until(A < e) 
and for m=l,...,M. 
4 Theoretical Results 
This section provides bounds on the parameter, 7(k), controlling the AM algorithm's 
0-update so that successive x  vector solutions lie within the SSD algorithm's 
region of quadratic convergence and on v controlling the Q-update so that successive 
central points are not too far apart, thus allowing convergence of the O-update. 
Theorem 1 Let 0 ) and 0 )'* be the current and minimizing parameter vec- 
tors at time k. Let c  = c(0()) and c ()'* = c(0(k)'*). Let 6(x,a, 0) = 
5 74 Michael Lemmon, Peter T. Szymanski 
IIPAxX (-(O)-x -1) II be the step size 
AT(AAT)-iA and X = diag(xl,...,x,). 
6 < 0.5 and let 0 +) = (1--rCk))O ) +-rCk)o +l)'*. If 7() is chosen as 
6 - 61 
)  
where 6 < 6 = 0.5, then 6(x (+), (+), O (+))  6 = 0.5. 
of the SSD update where PA = I- 
Assume that 6(x(+l), c(k+l), 0 ) = 
(13) 
Proof: The proof must show that the choice of 7  maintains the nearness ofx (+l) 
to the central path after the O-update. Let h(x,c, 
hi = h(x (k+l), c (+l), 0 ) and h2 = h(x (+l), c (+l), O(k+l)). Using the triangle 
inequality produces 
IIn211 < IIn - hill + Ilnlll. 
Ilhlll = 61 by assumption, so 
IIh11 _< Ilc(+l)PAxX(c(+l) - c(*))11 + 
Using the convexity of the cost vectors produces (c (+l) - e ) _< (1 - 7())e  + 
7(t)e( k+l),* - e  resulting in 
Ilh=ll _< Ila(k+l)7()PAxX(c(+l)'* - c())ll + 61. 
Using the fact that IIPxXll _< x - n/() (x is the duality gap), 
IIh=11 _< k)(+l)11PxXll IIc(+x),*-c()11 
< ?()n1 +./v)ll(+l), *- 
Plugging in the value of 7  from (13) and simplifying produces the desired result 
Ilh211 _< 52 _< 0.5, guaranteeing that x (+l) remains close to x( +l),* after the O- 
update. [] 
Theorem 1 shows that the non-linear optimization can be embedded within the steps 
of the path following algorithm without it taking solutions too far from successive 
central points. The following two results, found in [7], provide a bound on , to 
guarantee convergence of the O-update. The bound on, forces successive central 
points to be close and ensures convergence of the O-update. 
Proposition 1 Let B = YzPzQ(mlz) zzT, E = YzPzAQ(mlz) zzT, w = 
EzpzQ(mlz)y(z)z , and Aw = EzpzAQ(mlz)y(z)z , where Q(mlz ) = Q(*)(mlz ) 
and AQ(mlz ) = Q(+l)(m]z)- Q(()mlz). Assume that ly(z)l < Y, ]lzll _< , and 
that B is of full rank for all valid Q's. Finally, let !amax =supQ IIB-tI]. Then, 
[[O  - O)'*1[ _< 2(, 2 + 2.)K (14) 
where Zf - t'.aS (1 + .m.) and r- liB-ill IIEII < 1. 
1-r 
Theorem 2 Assume that I1,.11 < C, ly(z)l 5 Y, IIO11 < o., and that B = 
z z(mlz)zz r is offutt rank for atl valid Q's and that liB-ill I111- r < 1. zf 
. < min{0.1, 
-1 + V'I + r/(2C2pma), (15) 
-1 + /1 +7mine/(2K)} 
Interior Point Implementations of Alternating Minimization Training 575 
where K =/treatY((1 + (2#ma/(1 -- r)), 7win= (62 - 5x)/(-(1 + - 
(o){{) and es is the largest o){I such that 7(k) = 1, then the O-update 
will converge with IIO +)'* - O)[[--+ 0 and 7  --+ 1 as k increases. 
The preceding results guarantee the convergence of the component minimizations 
separately. Convergence of the total algorithm relies on the simultaneous conver- 
gence of both steps. This is currently being addressed using contraction mapping 
concepts and stability results from nonlinear stability analysis [8]. 
The convergence rate of the algorithm is established using the LP problem's duality 
gap. The duality gap is the difference between the current solutions for the primal 
and dual formulations of the LP problem. Path following algorithms allow the 
duality gap to be expressed as follows 
A((()) _ n + 0.5V (16) 
(() 
and thus provide a convenient stopping criterion for the algorithm. Note that 
c  = c()//  where/ < (l+v/x/-). This implies that A  =/A  _</2 . Ifk 
is chosen so that/2  < 2 -, then A  < 2 - which implies that k k 2L/log(1/t). 
Inserting our choice of/ one finds that k ) (2v/-L/v)+2L. The preceding argument 
establishes that the .proposed convergence rate of O(v/-L) iterations. In other 
words, the procedure's training time scales in a sublinear manner with network size. 
5 Simulation Example 
Simulations were run on a time series prediction task to test the proposed algorithm. 
The training set is 7' = {(zl,yi) ' Yi = y(iT),zi = (Yi-x,Yi-2, ...,Yi-N) T  IR,N} 
for i = 0, 1, ..., 100, N - 4, and T = 0.04 where the time series is defined as 
y(t) = sin(rt) - sin(2rt) + sin(3rt) - sin(rt/2) (17) 
The results describe the convergence of the algorithm. These experiments con- 
sisted of 100 randomly chosen samples with N = 4 and a number of agent neurons 
ranging from M = 4 to 20. This corresponds to an LP problem dimension of 
n = 404 to 2020. The stopping criteria for the tests was to run until the solution 
was within e = 10 -3 of a local minimum. The number of iterations and floating 
point operations (FLOPS) for the AM algorithm to converge are shown in Fig- 
ures l(a) and l(b) with AM results denoted by "o" and the theoretical rates by a 
solid line. The algorithm exhibits approximately O(v/L) iterations to converge as 
predicted. The computational cost, however is O(n2L) FLOPS which is better than 
the predicted O(n3'SL). The difference is due to the use of sparse matrix techniques 
which reduce the number of computations. The resulting AM algorithm then has 
the complexity of a matrix multiplication instead of a matrix inversion. The use of 
the algorithm resulted in networks having mean square errors on the order of 10 -3 . 
6 Discussion 
This paper has presented an AM algorithm which can be proven to converge in 
O(vL ) iterations. The work has established a means by which IP methods can be 
576 Michael Lemrnon, Peter T. Szymanski 
= 10 4 
._o 10 3 
.,.., 
o 
e 10 5 
0 
o 
10 m 
o 
10 2 .........  10 5 
10 2 10 3 10 4 10 2 
LP Problem Size (n) 
(a) 
O0 O0 
10 3 10 4 
LP Problem Size (n) 
(b) 
Figure 1: Convergence rates as a function of n 
applied to NN training in a way which preserves the computational efficiency of IP 
solvers. The AM algorithm can be used to solve off-line problems such as codebook 
generation and parameter identification in colony control applications. The method 
is currently being used to solve hybrid control problems of the type in [9]. Areas of 
future research concern the study of large-step IP methods and extensions of AM 
training to other EM algorithms. 
References 
[1] S. Nowlan, "Maximum likelihood competitive learning," in Advances in Neural Infor- 
mation Processing Systems , pp. 574-5827 San Mateo, California: Morgan Kaufmann 
Publishers, Inc., 1990. 
[2] M. Jordan and R. Jacobs, "Hierarchical mixtures of experts and the EM algorithm," 
Tech. Rep. 9301, MIT Computational Cognitive Science, Apr. 1993. 
[3] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton, "Adaptive mixtures of local experts," 
Neural Computation, vol. 3, pp. 79-87, 1991. 
[4] W. Byrne, "Alternating minimization and Boltzmann machine learning," IEEE Trans- 
actions on Neural Networks, vol. 3, pp. 612-620, July 1992. 
[5] M. Jordan and L. Xu, "Convergence results for the EM approach to mixtures of experts 
architectures," Tech. Rep. 9303, MIT Computational Cognitive Science, Sept. 1993. 
[6] C. Gonzaga, "Path-following methods for linear programming," SIAM Review, vol. 34, 
pp. 167-224, June 1992. 
[7] P. Szymanski and M. Lemmon, "A modified interior point method for supervisory con- 
troller design," in Proceedings of the 33rd IEEE Conference on Decision and Control, 
pp. 1381-1386, Dec. 1994. 
[8] M. Vidyasagar, Nonlinear Systems Analysis. Englewood Cliffs, New Jersey: Prentice- 
Hall, Inc., 1993. 
[9] M. Lemmon, J. Stiver, and P. Antsaklis, "Event identification and intelligent hybrid 
control," in Hybrid Systems (R. L. Grossman, A. Nerode, A. P. Ravn, and H. Rischel, 
eds.), vol. 736 of Lecture Notes in Computer Science, pp. 265-296, Springer-Verlag, 
1993. 
