Support Vector Method for Multivariate 
Density Estimation 
Vladimir N. Vapnik 
Royal Halloway College and 
AT&T Labs, 100 Schultz Dr. 
Red Bank, NJ 07701 
vlad res earch. a tt. corn 
Sayan Mukherjee 
CBCL, MIT E25-201 
Cambridge, MA 02142 
sayanai. mit. edu 
Abstract 
A new method for multivariate density estimation is developed 
based on the Support Vector Method (SVM) solution of inverse 
ill-posed problems. The solution has the form of a mixture of den- 
sities. This method with Gaussian kernels compared favorably to 
both Parzen's method and the Gaussian Mixture Model method. 
For synthetic data we achieve more accurate estimates for densities 
of 2, 6, 12, and 40 dimensions. 
I Introduction 
The problem of multivariate density estimation is important for many applications, 
in particular, for speech recognition [1] [7]. When the unknown density belongs to a 
parametric set satisfying certain conditions one can estimate it using the maximum 
likelihood /ML) method. Often these conditions are too restrictive. Therefore, 
non-parametric methods were proposed. 
The most popular of these, Parzen's method [5], uses the following estimate given 
data X 1 , ..., X: 
1 
p(t, = xd, (1) 
i----1 
where K (t - xi) is a smooth function such that f K (t - x)dt = 1. Under some 
conditions on ff and K (t- x), Parzen's method converges with a fast asymptotic 
rate. An example of such a function is a Gaussian with one free parameter 72 (the 
width) 
i --1 
K(t - xi) -- (27r)n/2ff exp {-(t - xi) T (7I) (t - xi)}. (2) 
The structure of the Parzen estimator is too complex: the number of terms in (1) 
is equal to the number of observations (which can be hundreds of thousands). 
660 V. N. Vapnik and S. Mukherjee 
Researchers believe that for practical problems densities can be approximated by 
a mixture with few elements (Gaussians for Gaussian Mixture Models (GMM)). 
Therefore, the following parametric density model was introduced 
m m 
P(x, a, E) = Z aiP(x, ai, Ei), a _> 0, Z ai = 1, (3) 
i----1 i=1 
where P(x, ai, Zi) are Gaussians with different vectors ai and different diagonal 
covariance matrices Zi; ai is the proportion of the i-th Gaussian in the mixture. 
It is known [9] that for general forms of Gaussian mixtures the ML estimate does not 
exist. To use the ML method two values are specified: a lower bound on diagonal 
elements of the covariance matrix and an upper bound on the number of mixture 
elements. Under these constraints one can estimate the mixture parameters using 
the EM algorithm. This solution, however, is based on predefined parameters. 
In this article we use an $VM approach to obtain an estimate in the form of a 
mixture of densities. The approach has no free parameters. In our experiments it 
performs better than the GMM method. 
2 Density estimation is an ill-posed problem 
A density p(t) is defined as the solution of the equation 
t) dt - F(x), (4) 
where F(x) is the probability distribution function. Estimating a density from data 
involves solving equation (4) on a given set of densities when the distribution func- 
tion F(x) is unknown but a random i.i.d. sample x, ...,xt is given. The empirical 
distribution function Ft (x) is a good approximation of the actual distribution, 
1  
i----1 
where O(u) is the step-function. In the univariate case, for sufficiently large 
the distribution of the supremum error between F(x) and Ft(x) is given by the 
Kolmogorov-Smirnov distribution 
P{sup IF(x)- F(x)l < e/x/} = 1- 2 Z(-1) k-1 exp{-2e2k2}. 
x k----1 
(5) 
Hence, the problem of density estimation can be restated as solving equation (4) 
but replacing the distribution function F(x) with the empirical distribution function 
Ft(x) which converges to the true one with the (fast) rate O(5 )' for univariate and 
multivariate cases. 
The problem of solving the linear operator equation Ap = F with approximation 
Ft(x) is ill-posed. 
In the 1960's methods were proposed for solving ill-posed problems using approx- 
imations Ft converging to F as i increases. The idea of these methods was to 
Support Vector Method for Multivariate Density Estimation 661 
introduce a regularizing functional Ft(p) (a semi-continuous, positive functional for 
which Ft(p) <_ c, c > 0 is a compactum) and define the solution pt which is a 
trade-off between ft(p) and IIAp- Ftll. 
The following two methods which are asymptotically equivalent [11] were proposed 
by Tikhonov [8] and Phillips [6] 
min [11Ap- Ftll= + t(P)], t > 0, t  0, (6) 
p 
min Ft(p) s.t. IIAP- Ftll< et, et > 0, et  0. (7) 
p 
For the stochastic case it can be shown for both methods that if Ft (x) converges in 
probability to F(x) and "Yt - 0 then for sufficiently large  and arbitrary  and/ 
the following inequality holds [10] [9] [3] 
P(PE(P, Pt) > ) <_ P(PE2(F, Ft) > 
(8) 
where t > t0(v,u) and PE(P, Pt), pE2(F, Ft) are metrics in the spaces p and F. 
1 
Since Ft(x) - F(x) in probability with the rate O(), from equation (8) it follows 
that if "Yt > O() the solutions of equation (4) are consistent. 
3 Choice of regularization parameters 
For the deterministic case the residual method [2] can be used to set the regular- 
ization parameters (fit in (6) and st in (7)) by setting the parameter (fit or st) such 
that Pt satisfies the following 
IIApt -Ftll = IIF(x) - Ft(x)11-- 
(9) 
where at is the known accuracy of approximation of F(x) by Ft (x). We use this idea 
for the stochastic case. The Kolmogorov-Smirnov distribution is used to set at, at = 
c/x/, where c corresponds to an appropriate quantile. For the multivariate case 
one can either evaluate the appropriate quantile analytically [4] or by simulations. 
The density estimation problem can be solved using either regularization method 
(6) or (7). Using method (6) with a L2 norm in image space F and regularization 
functional Ft(p) = (Tp, Tp) where 7- is a convolution operator, one obtains Parzen's 
method [10] [9] with kernels defined by operator 7-. 
4 SVM for density estimation 
We apply the SVM technique to equation (7) for density estimation. We use the C 
norm in (7) and solve equation (4) in a set of functions belonging to a Reproducing 
Kernel Hilbert Space (RKHS). We use the regularization functional 
ft(p) -IIpllt - (p,p)t. (lO) 
A RKHS can be defined by a positive definite kernel K(x, y) and an inner product 
(f, g)t in Hilbert space 7-/such that 
(f(x),K(x,y)) = f(y) Vf e 7. (11) 
662 V. N. Vapnik and S. Mukherjee 
Note that any positive definite function K(x, y) has an expansion 
K(x,y) --  Aiqbi(x)qbi(y) (12) 
i----1 
where Ai and (hi(x) are eigenvalues and eigenfunctions of K(x, y). Consider the set 
of functions 
oo 
and the inner product 
f(x,c) = Z cici(x) 
i----1 
(13) 
max 
i 
subject to the constraint 
Ft(x) - p(t)dt 
Xri 
We look for a solution of equation (4) with the form 
p(t)- Z/iK(xi,t). 
Accounting for (16) and (11) minimizing (10) is equivalent to minimizing 
 
ft(p,p) -- (P,P)t --  /i/jK(xi,xj) 
i,j-----1 
(16) 
(17) 
(f(x,c*) f(x,c**) =  cici (14) 
' 
i----1 
Kernel (12), inner product (14), and set (13) define a RKHS and 
(f(x),K(x,y))t=(ci(bi(x),K(x,y)) = 
i=1 
= = 
ki=l i=1  
For functions from a RKHS the functional (10) has the form 
 2 
= 
where Ai is the i-th eigenvalue of the kernel K(x, y). The choice of the kernel defines 
smoothness properties on the solution. 
To use method (7) to solve for the density in equation (4) in a RKHS with a solution 
satisfying condition (9) we minimize 
(p) = (p,p) 
Support Vector Method for Multivariate Density Estimation 663 
subject to constraints 
max 
i 
=at, (18) 
/i _ 0, Z/i = 1. (19) 
i----1 
This optimization problem is closely related to the SV regression problem with an 
at-insensitive zone [9]. It can be solved using the standard SVM technique. 
Generally, only a few of the/i will be nonzero, the xi corresponding to these/i are 
called support vectors. 
Note that kernel (2) has width parameter fit. We call the value of this parameter 
admissible if it satisfies constraint (18) (the solution satisfies condition (9)). The 
admissible set "min __ " __ "ma:r is not empty since for Parzen's method (which 
also has form (16)) such a value does exist. Among the fit in this admissible set 
we select the one for which (p) is smallest or the number of support vectors is 
minimum. 
Choosing other kernels (for example Laplacians) one can estimate densities using 
non-Gaussian mixture models which for some problems are more appropriate [1]. 
5 Experiments 
Several trials of estimates constructed from sampling distributions were examined. 
Boxplots were made of the Li(p) norm over the trials. The horizontal lines of the 
boxplot indicate the 5%, 25%, 50%, 75%, and 95% quantiles of the error distribution. 
For the SVM method we set at = c//, where c = .36, .41, .936, and 1.75 for 
two, six, twelve and forty dimensions. For Parzen's method t was selected using 
a leave-one out procedure. The GMM method uses the EM algorithm and sets all 
parameters except n, the upper bound on the number of terms in the mixture [7]. 
Figure (1) shows plots of the SVM estimate using a Gaussian kernel and the GMM 
estimate when 60 points were drawn form a mixture of a Gaussian and Laplacian 
in two dimensions. 
Figure (2a) shows four boxplots of estimating a density defined by a mixture of 
two Laplacians in a two dimensional space using 200 observations. Each boxplot 
shows outcomes of 100 trials: for the SVM method, Parzen's method, and the GMM 
method with parameters n = 2, and n = 4. Figure (2c) shows the distribution of 
the number of terms for the SVM method. 
Figure (2b) shows boxplots of estimating a density defined by the mixture of four 
Gaussians in a six dimensional space using 600 observations. Each boxplot shows 
outcomes of 50 trials: for the SVM method, Parzen's method, and the GMM method 
with parameters n - 4, and n -- 8. Figure (2c) shows the distribution of the number 
of terms for the SVM method. 
Figure (3a) shows boxplots of outcomes of estimating a density defined by the 
mixture of four Gaussians and four Laplacians in a twelve dimensional space using 
664 V. N. Vapnik and S. Mukherjee 
400 observations. Each boxplot shows outcomes of. 50 trials: for the SVM method, 
Parzen's method, and the GMM method with parameter n = 8. Figure (3c) shows 
the distribution of the number of terms for the SVM method. 
Figure (3b) shows boxplots of outcomes of estimating a density defined by the 
mixture of four Gaussians and four Laplacians in a forty dimensional space using 
480 observations. Each box-plot shows outcomes of 50 trials: for the SVM method, 
Parzen's method, and the GMM method with parameter n - 8. Figure (3c) shows 
the distribution of the number of terms for the SVM method. 
6 Summary 
A method for multivariate density estimation based on the SVM technique for 
solving ill-posed problems is introduced. This method has a form of a mixture of 
densities. The estimate in general has only a few terms. In experiments on synthetic 
data this method is more accurate than the GMM method. 
References 
[1] 
S. Basu and C.A. Micchelli. Parametric density estimation for the classification of 
acoustic feature vectors in speech recognition. In Nonlinear Modeling, Advanced 
Black-Box Techniques. Kluwer Publishers, 1998. 
[2] V.A. Morozov. Methods for solving incorrectly posed problems. Springer-Verlag, 
Berlin, 1984. 
[3] S. Mukherjee and V. Vapnik. Multivariate density estimation: An svm approach. AI 
Memo 1653, Massachusetts Institute of Technology, 1999. 
[4] S. Paramasamy. On multivariate kolmogorov-smirnov distribution. Statistics 4 Prob- 
ability Letters, 15:140-155, 1992. 
[5] E. Parzen. On estimation of a probability density function and mode. Ann. Math. 
Statis., 33:1065-1076, 1962. 
[6] D.L. Phillips. A technique for the numerical solution of integral equations of the first 
kind. J. Assoc. Cornput. Machinery, 9:84-97, 1962. 
[7] 
D. Reynolds and R. Rose. Robust text-independent speaker identification using gaus- 
sian mixture speaker models. IEEE Trans on Speech and Audio Processing, 3(1):1-27, 
1995. 
[8] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization 
method. Soviet Math. Dokl., 4:1035-1038, 1963. 
[9] V. N. Vapnik. Statistical learning theory. J. Wiley, 1998. 
[10] V.N. Vapnik and A.R. Stefanyuk. Nonparametric methods for restoring probability 
densities. Avtomatika i Telemekhanika, (8):38-52, 1978. 
[11] V.V. Vasin. Relationship of several variational methods for the approximate solution 
of ill-posed problems. Math Notes, 7:161-166, 1970. 
Support Vector Method for Multivariate Density Estimation 665 
(a) (b) (c) 
Figure 1: (a) The true distribution (b) the GMM case with 4 mixtures (c) the 
Parzen case (d) the SVM case for 60 points. 
0.16 
0.14 
0.12 
 0.1 
.o.o 
0.0 
0.04 
0.02 
x 10 -a 
1.9 f 
1.?I 
1.6 
1.5 
$VM 
SVM Parzen GMM 2 GMM 4 Parzen GMM 4 GMM 8 2 rens,on. e d,rnenons 
(b) 
Figure 2: (a) Boxplots of the Li(p) error for the mixture of two Laplacians in two 
dimensions for the SVM method, Parzen's method, and the GMM method with 2 
and 4 Gaussians. (b) Boxplots of the Li(p) error for mixture of four Gaussians 
in six dimensions with the $VM method, Parzen's method, and the GMM method 
with 4 mixtures. (c) Boxplots of distribution of the number of terms for the SVM 
method for the two and six dimensional cases. 
0.16 
0.14 
0.12 
0.1 
0.04 
0.12 
011 
0.08 
.q o.o8 
0.04 / 
0.02 
SVM Parzen GMM 8 SVM Paten GMM 8 12 Dim 40 Dim 
(b) 
Figure 3: (a) Boxplots of the Li(p) error for the mixture of four Laplacians and 
four Gaussians in twelve dimensions for the SVM method, Parzen's method, and the 
GMM method with 8 Gaussians. (b) Boxplots of the Li(p) error for the mixture 
of four Laplacians and four Gaussians in forty dimensions for the SVM method, 
Parzen's method, and the GMM method with 8 Gaussians. (c) Boxplots of dis- 
tribution of the number of terms for the SVM method for the twelve and forty 
dimensional cases. 
