Algebraic Analysis for Non-Regular 
Learning Machines 
Sumio Watanabe 
Precision and Intelligence Laboratory 
Tokyo Institute of Technology 
4259 Nagatsuta, Midori-ku, Yokohama 223 Japan 
swatanabpi. titech. ac.jp 
Abstract 
Hierarchical learning machines are non-regular and non-identifiable 
statistical models, whose true parameter sets are analytic sets with 
singularities. Using algebraic analysis, we rigorously prove that 
the stochastic complexity of a non-identifiable learning machine 
is asymptotically equal to Allogn- (ml - 1)loglogn q-const., 
where n is the number of training samples. Moreover we show that 
the rational number A1 and the integer ml can be algorithmically 
calculated using resolution of singularities in algebraic geometry. 
Also we obtain inequalities 0  /1  d/2 and I _< ml _< d, where d 
is the number of parameters. 
I Introduction 
Hierarchical learning machines such as multi-layer perceptrons, radial basis func- 
tions, and normal mixtures are non-regular and non-identifiable learning machines. 
If the true distribution is almost contained in a learning model, then the set of 
true parameters is not one point but an analytic variety [4][9][3][10]. This paper 
establishes the mathematical foundation to analyze such learning machines based 
on algebraic analysis and algebraic geometry. 
Let us consider a learning machine represented by a conditional probability density 
p(xlw ) where x is an M dimensional vector and w is a d dimensional parameter. We 
assume that n training samples X n -- {Xi; i = 1, 2, ..., n} are independently taken 
from the true probability distribution q(x), and that the set of true parameters 
*Vo = {w c ,v ; p(xlw)= q(x) (a.s. q(x)) } 
is not empty. In Bayes statistics, the estimated distribution p(xIX n) is defined by 
/ in 
p(xlX n) = p(xlw ) pn(w)dw, pn(W) -- n IIP(Xilw) 
i----1 
where (w) is an a priori probability density on R a, and Zn is a normalizing con- 
stant. The generalization error is defined by 
f q(x) x 
K(n) = Ex,{ q(x) logp(xlX,) 
Algebraic Analysis for Non-regular Learning Machines 357 
where Ex, {.} shows the expectation value over all training samples X '. One of 
the main purposes in learning theory is to clarify how fast K(n) converges to zero 
as n tends to infinity. Using the log-loss function h(x, w) = log q(x) - logp(x, w), 
we define the Kullback distance and the empirical one, 
I  h(Xi, w). 
Z(w) ---- h(x, w)q(x)dx, Z(w,X n) -- n i--1 
Note that the set of true parameters is equal to the set of zeros of H(w), W0 = 
{w E W; H(w) = 0}. If the true parameter set W0 consists of only one point, the 
learning machine p(xlw ) is called identifiable, if otherwise non-identifiable. It should 
be emphasized that, in non-identifiable learning machines, W0 is not a manifold 
but an analytic set with singular points, in general. Let us define the stochastic 
complexity by 
F(n) = -Ex.{log / exp(-nH(w, Xn)p(w)dw}. 
(1) 
Then we have an important relation between the stochastic complexity F(n) and 
the generalization error K(n) 
K(n) = F(n + 1)- F(n), 
which represents that K(n) is equal to the increase of F(n) [1]. In this paper, we 
show the rigorous asymptotic form of the stochastic complexity F(n) for general 
non-identifiable learning machines. 
2 Main Results 
We need three assumptions upon which the main results are proven. 
(A.1) The probability density q(w) is infinite times continuously differentiable and 
its support, W = supp q, is compact. In other words, q E C. 
(A.2) The log loss function, h(x, w) = log q(x) - log p(x, w), is continuous for x in 
the support Q = suppq, and is analytic for w in an open set W  D W. 
(A.3) Let {rj(x, w*);j = 1,2,..., d} be the associated convergence radii of h(x, w) 
at w*, in other words, Taylor expansion of h(x, w) at w* = (w, ..., w), 
h(x,w) -- 
kl,..,kd=O 
aklk2...kl(x)(w 1 -- w)kl(w2 -- w)k2 ...(W d -- W) kd 
absolutely converges in Iwj - 
j = 1,2,...,d. 
Assume inf inf 
xEQw*EW 
rj(x, w*) > 0 for 
Theorem 1 Assume (A.1),(A.2), and (A.3). Then, there exist a rational number 
A1 > O, a natural number ml, and a constant C, such that 
IF(n)-/1 logn q-(ml -- 1)loglogn I < C 
holds for an arbitrary natural number n. 
Remarks. (1) If q(x) is compact supported, then the assumption (A.3) is automat- 
ically satisfied. (2) Without assumptions (A.1) and (A.3), we can prove the upper 
bound, F(n) _</1 logn - (ml - 1) loglogn q- const. 
358  anabe 
From Theorem 1, if the generalization error K(n) has the asymptotic expansion, 
then it should be 
hi ml -- 1 1 
K(n) - +o(--). 
n n log n n log n 
As is well known, if the model is identifiable and has the positive definite Fisher 
information matrix, then /1 - d/2 (d is the dimension of the parameter space) and 
ml = 1. However, hierarchical learning models such as multi-layer percepttons, 
radial basis functions, and normal mixtures have smaller/1 and larger ral, in other 
words, hierarchical models are better learning machines than regular ones if Bayes 
estimation is applied. Constants 1 and ral are characterized by the following 
theorem. 
Theorem 2 Assume the same conditions as theorem 1. Let e > 0 be a sufficiently 
small constant. The holomorphic function in Re(z)  O, 
(z) = fu I(w)Z(w)dw' 
()< 
can be analytically continued to the entire complex plane as a meromorphic function 
whose poles are on the negative part of the real axis, and the constants -A1 and ml 
in theorem i are equal to the largest pole of J(z) and its multiplicity, respectively. 
The proofs of above theorems are explained in the following section. Let w = g(u) 
is an arbitrary analytic function from a set U C R d to W. Then J(z) is invariant 
under the mapping, 
{H(w), (w)}  (H(g(u)), 
where [g'(u)l = [det(Owi/Ouj) I is Jacobian. This fact shows that /1 and ml are in- 
variant under a bi-rational mapping. In section 4, we show an algorithm to calculate 
/1 and ral by using this invariance and resolution of singularities. 
3 Mathematical Structure 
In this section, we present an outline of the proof and its mathematical structure. 
3.1 Upper bound and b-function 
For a sufIiciently small constant e > 0, we define F* (n) by 
F*(n) = -log/u exp(-nH(w)) (w) dw. 
()< 
Then by using the Jensen's inequality, we obtain F(n) _< F* (n). To evaluate F* (n), 
we need the b-function in algebraic analysis [6][7]. Sato, Bernstein, Bj6rk, and 
Kashiwara proved that, for an arbitrary analytic function H(w), there exist a dif- 
ferential operator D(w, Ow, z) which is a polynomial for z, and a polynomial b(z) 
whose zeros are rational numbers on the negative part of the real axis, such that 
D(w, Ow, z)H(w) z+l -- b(z)H(w)  
(2) 
for any z C C and any w c Wc = {w  W; H(w) < e}. By using the relation eq.(2), 
the holomorphic function J(z) in Re(z) > 0, 
Ju 1 fu H w Z+lD* 
J(z) -- H(w) =(w)dw = bz) ( ) wo(w)dw' 
(w)<c ()< 
,41gebraic ,4nalysis for Non-regular Learning Machines 359 
can be analytically continued to the entire complex plane as a meromorphic func- 
tion whose poles are on the negative part of the real axis. The poles, which are 
rational numbers and ordered from the origin to the minus infinity, are referred to 
as -A1, -A2, -As, ..., and their multiplicities are also referred to as ml, m2, m3, ... 
Let Ckm be the coefficient of the m-th order of Laurent expansion of J(z) at -Ak. 
Then, 
K 
Ckm 
k=l m=l 
is holomorphic in Re(z) > -A:+i, and 
Let us define a function 
(t) = f (t - S(w))(w)dw 
for 0 < t < e and I(t) = 0 for e _< t _< 1. Then I(t) connects the function F*(n) 
with J(z) by the relations, 
(z) - t z (t) dr, 
1 
F*(n) = -log f0 exp(-nt) I(t) dt. 
The inverse Laplace transform gives the asymptotic expansion of I(t) as t - O, 
o mk 
Ckm tX-I (_ log t) m-1 
I(t)=   (m-l)! ' 
k----1 m=l 
resulting in the asymptotic expansion of F*(n), 
fo n t dt 
F*(n) = -log exp(-t) I() n 
= A11ogn--(m1 -- 1)1oglogn +O(1), 
which is the upper bound of F(n). 
3.2 Lower Bound 
We define a random variable 
A(X n) -- sup I nl/2(H(w,X n) - H(w)) / H(w) 1/2 I. (a) 
wEW 
Then, we prove in Appendix that there exists a constant co which is independent 
of n such that 
sx {A(X) 2} < co. 
By using an inequality ab _< (a 2 + b2)/2, 
TI, H(w,X n) _ Tl, H(w) -- A(xn)(TI, H(w)) 1/2 _ {TtH(w)- A(xn)2}, 
which derives a lower bound, 
_> -Ex {log /exp(--{nH(w)- A(Xn)2})(w)dw} 
-1E {A(Xn) 2} - log/exp(nil(w) 
= 2 x 2 )(w)w 
F(n) 
(5) 
(6) 
360 S. Watanabe 
The first term in eq.(6) is bounded. Let the second term be F,(n), then 
F,(n) = -log(Z1 4- Z) 
Z1 --- /H exp(nil(w) 
(,)< 2 
Z2 = /H exp(nil(w) 
(w)>e 2 ) 99(w)dw -< exp(--), 
which proves the lower bound of F(n), 
F(n) _> A1 logn- (ml - 1) loglogn +const. 
) 99(w)dw  const. n -A1 (log n) ml-1 
4 Resolution of Singularities 
In this section, we construct a method to calculate A1 and mi. First of all, we cover 
the compact set 14/0 with a finite union of open sets W a. In other words, 14/0 C 
OaW a. Hironaka's resolution of singularities [5][2] ensures that, for an arbitrary 
analytic function H(w), we can algorithmically find an open set U a C R d (U a 
contains the origin) and an analytic function ga  U a -4 W a such that 
H(go(/)) m a() 1 2 . . . d (  U ) (7) 
where a(u) > 0 is a positive function and k  0 (1  i  d) are even integers (a(u) 
and ki depend on Ua). Note that Jacobian [g(u)l = 0 if and only if u  gl(W0). 
finite 
(9(=))19(=)1 Cpl,P2 .... ,Pd ' ' + (8) 
=  
(p,p .... ,p) 
By combining eq.(7) with eq.(8), we obtain 
J,(z)  /w H(w)Z99(w) 
---- /Uc a(Tl) {?Jl 1 U2 k2 ... ,dkd} Z 
For real z, max Jo,(z) _< J(z) _< -'.o, J(z), 
?,1 ,12o.. Ud dul du2... dud. 
/1 : min min min Pq + 1 
a (Pl .... ,pd) l_<q_<d kq 
and ml is equal to the number of q which attains the minimum, min. 
l _< q_< d 
Remark. In a neighborhood of w0 c W0, the analytic function H(w) is equivalent 
to a polynomial Hwo(W), in other words, there exists constants Cl, c > 0 such that 
clHwo(W) < H(w) _< cHwo(W ). Hironaka's theorem constructs the resolution map 
ga for any polynomial Hwo(W) algorithmically in the finite procedures ( blowing- 
ups for nonsingular manifolds in singularities are recursively applied [5]). From the 
above discussion, we obtain an inequality, I _< m _< d. Moreover there exists 3' > 0 
such that H(w) _< 3'l TM - w0[ 2 in the neighborhood of w0 c W0, we obtain/1 _ d/2. 
Example. Let us consider a model (x, y) G R 2 and w = (a, b, c, d) G R 4, 
I 1 
p(x, ylw) = po(x) (2701/2 exp(-(y - b(x,w))2), 
b(x,a,b,c,d) = atanh(bx) + ctanh(dx), 
Algebraic Analysis for Non-regular Learning Machines 361 
where po(x) is a compact support probability density (not estimated). We also 
assume that the true regression function is y = p(x, 0, 0, 0, 0). The set of true 
parameters is 
Wo -- {Ex b(X, a, b, c, d) 2 =0} = {ab + cd : O and ab a + cd a -0}. 
Assumptions (A.1),(A.2), and (A.3) are satisfied. The singularity in W0 which gives 
the smallest A is the origin and the average loss function in the neighborhood W  
of the origin is equivalent to the polynomial Ho(a, b, c, d) = (ab +cd) 2 + (ab a +cd3) 2, 
(see[9]). Using blowing-ups, we find a map g: (x, y, z, w) H (a, b, c, d), 
a-x, b = yaw - yzw, c = zwx, d = y, 
by which the singularity at the origin is resolved. 
J(z) = /w H(a'b'c'd)ZT(a'b'c'd)dadbdcdd 
o 
= f{ x2y6w211 + (z + w2(y 2 -- z)a)2]}lxyawl(g(x,y,z,w))dxdydzdw, 
which shows that A = 2/3 and ra = 1, resulting that F(n) = (2/3) log n + Const. 
If the generalization error can be asymptotically expanded, then K(n)  (2/3n). 
5 Conclusion 
Mathematical foundation for non-identifiable learning machines is constructed based 
on algebraic analysis and algebraic geometry. We obtained both the rigorous asymp- 
totic form of the stochastic complexity and an algorithm to calculate it. 
Appendix 
In the appendix, we show the inequality eq.(5). 
Lemma 1 
Assume conditions (A.1), (A.2) and (A.3). Then 
n 
E x, { sup 1 
wEwl [ n(Xi, w)-Exn(X,w) ] l") 
i=1 
This lemma is proven by using just the same method as [10]. In order to prove 
(5), we divide 'supwew' in eq.(4) into 'suPH(w)> c' and 'suPH(w)<c'. Finiteness of 
the first half is directly proven by Lemma 1. Let us prove the second half is also 
finite. We can assume without loss of generality that w is in the neighborhood 
of w0 c W0, because W can be covered by a finite union of neighborhoods. In 
each neighborhood, by using Taylor expansion of an analytic function, we can find 
functions {fj(x, w)} and {gj(w) = 1-gi(wi- woi) a } such that 
J 
h(x, w) -- E gj(w)fj(x, w), (9) 
j=l 
where {fj(x, w0)} are linearly independent functions of x and gj(wo) = 0. Since 
gj(w)fj(x, w) is a part of Taylor expansion among w0, fj(x, w) satisfies 
n 
Ex,, { sup I 1---- E(fj(Xi, w ) - Exfj(X,w))l 2} < o. (10) 
wEW  i=1 
362 S. Watanabe 
By using a definition ff/(w) --]H(w,X n) - H(w)], 
;/(w) 
n J 
1 E{Egj(w)(fj(Xi, w ) _ Exfj(X,w))}l 2 
i=l j=l 
J J n 
j= j= 
where we used Cauchy-Schwarz's inequality. On the other hand, the inequality 
logx _) (1/2) (log x) 2 - x + 1 (x > 0) shows that 
H(w) -- / q(x) log 
q(x) dx_) 1/ q(x) ao a 
p(x, w)  q(/)(log )2dx > -- E gj(w) 
p(x, W) -- 2 j----1 
where a0 > 0 is the smallest eigen value of the positive definite symmetric matrix 
E x {fj(X, wo)fk(X, wo)}. Lastly, combining 
A(X ) = sup 
wCW 
1 n 
nff/(w) 2 ao sup {E(fj(Xi, w)-Exfj(X,w))}2 
H(w) -( - wEW j--1 i----1 
with eq.(10), we obtain eq.(5). 
Acknowledgments 
This research was partially supported by the Ministry of Education, Science, Sports 
and Culture in Japan, Grant-in-Aid for Scientific Research 09680362. 
References 
[1] Amari,S., Murata, N.(1993) Statistical theory of learning curves under entropic loss. 
Neural Computation, 5 (4) pp.140-153. 
[2] Atiyah, M.F. (1970) Resolution of singularities and division of distributions. Comm. 
Pure and Appl. Math., 13 pp.145-150. 
[3] Fukumizu,K. (1999) Generalization error of linear neural networks in unidentifiable 
cases. Lecture Notes in Computer Science, 1720 Springer, pp.51-62. 
[4] Hagiwara,K., Toda,N., Usui,S. (1993) On the problem of applying AIC to determine 
the structure of a layered feed-forward neural network. Proc. of IJCNN, 3 pp.2263-2266. 
[5] Hironaka, H. (1964) Resolution of singularities of an algebraic variety over a field of 
characteristic zero, I,II. Annals of Math., ?9 pp.109-326. 
[6] Kashiwara, M. (1976) B-functions and holonomic systems, Invent. Math., 38 pp.33-53. 
[7] Oaku, T. (1997) An algorithm of computing b-funcitions. Duke Math. J., 8? pp.115- 
132. 
[8] Sato, M., Shintani,T. (1974) On zeta functions associated with prehomogeneous vector 
space. Annals of Math., 100, pp.131-170. 
[9] Watanabe, S.(1998) On the generalization error by a layered statistical model with 
Bayesian estimation. IEICE Trans., J81-A pp.1442-1452. English version: Elect. Comm. 
in Japan., to appear. 
[10] Watanabe, S. (1999) Algebraic analysis for singular statistical estimation. Lecture 
Notes in Computer Science, 1720 Springer, pp.39-50. 
