{"title": "An Extended Level Method for Efficient Multiple Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1825, "page_last": 1832, "abstract": "We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efficient methods, i.e., Semi-Infinite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can significantly improve efficiency by saving on average 91.9% of computational time over the SILP method and 70.3% over the SD method.", "full_text": "An Extended Level Method for\n\nEf\ufb01cient Multiple Kernel Learning\n\nZenglin Xu\u2020\n\nRong Jin\u2021\n\nIrwin King\u2020\n\nMichael R. Lyu\u2020\n\n\u2020 Dept. of Computer Science & Engineering\n\nThe Chinese University of Hong Kong\n\nShatin, N.T., Hong Kong\n\n{zlxu, king, lyu}@cse.cuhk.edu.hk\n\n\u2021Dept. of Computer Science & Engineering\n\nMichigan State University\nEast Lansing, MI, 48824\n\nrongjin@cse.msu.edu\n\nAbstract\n\nWe consider the problem of multiple kernel learning (MKL), which can be for-\nmulated as a convex-concave problem. In the past, two ef\ufb01cient methods, i.e.,\nSemi-In\ufb01nite Linear Programming (SILP) and Subgradient Descent (SD), have\nbeen proposed for large-scale multiple kernel learning. Despite their success, both\nmethods have their own shortcomings: (a) the SD method utilizes the gradient of\nonly the current solution, and (b) the SILP method does not regularize the approx-\nimate solution obtained from the cutting plane model. In this work, we extend\nthe level method, which was originally designed for optimizing non-smooth ob-\njective functions, to convex-concave optimization, and apply it to multiple kernel\nlearning. The extended level method overcomes the drawbacks of SILP and SD\nby exploiting all the gradients computed in past iterations and by regularizing the\nsolution via a projection to a level set. Empirical study with eight UCI datasets\nshows that the extended level method can signi\ufb01cantly improve ef\ufb01ciency by sav-\ning on average 91.9% of computational time over the SILP method and 70.3%\nover the SD method.\n\n1 Introduction\n\nKernel learning [5, 9, 7] has received a lot of attention in recent studies of machine learning. This\nis due to the importance of kernel methods in that kernel functions de\ufb01ne a generalized similarity\nmeasure among data. A generic approach to learning a kernel function is known as multiple kernel\nlearning (MKL) [5]: given a list of base kernel functions/matrices, MKL searches for the linear com-\nbination of base kernel functions which maximizes a generalized performance measure. Previous\nstudies [5, 14, 13, 4, 1] have shown that MKL is usually able to identify appropriate combination of\nkernel functions, and as a result to improve the performance.\n\nA variety of methods have been used to create base kernels. For instance, base kernels can be created\nby using different kernel functions; they can also be created by using a single kernel function but\nwith different subsets of features. As for the performance measures needed to \ufb01nd the optimal ker-\nnel function, several measures have been studied for multiple kernel learning, including maximum\nmargin classi\ufb01cation errors [5], kernel-target alignment [4], and Fisher discriminative analysis [13].\n\nThe multiple kernel learning problem was \ufb01rst formulated as a semi-de\ufb01nite programming (SDP)\nproblem by [5]. An SMO-like algorithm was proposed in [2] in order to solve medium-scale prob-\nlems. More recently, a Semi-In\ufb01nite Linear Programming (SILP) approach was developed for\nMKL [12]. SILP is an iterative algorithm that alternates between the optimization of kernel weights\nand the optimization of the SVM classi\ufb01er. In each step, given the current solution of kernel weights,\nit solves a classical SVM with the combined kernel; it then constructs a cutting plane model for the\nobjective function and updates the kernel weights by solving a corresponding linear programming\n\n\fproblem. Although the SILP approach can be employed for large scale MKL problems, it often suf-\nfers from slow convergence. One shortcoming of the SILP method is that it updates kernel weights\nsolely based on the cutting plane model. Given that a cutting plane model usually differs signif-\nicantly from the original objective function when the solution is far away from the points where\nthe cutting plane model is constructed, the optimal solution to the cutting plane model could be\nsigni\ufb01cantly off target. In [10], the authors addressed the MKL problems by a simple Subgradient\nDescent (SD) method. However, since the SD method is memoryless, it does not utilize the gradi-\nents computed in previous iterations, which could be very useful in boosting the ef\ufb01ciency of the\nsearch.\n\nTo further improve the computational ef\ufb01ciency of MKL, we extended the level method [6], which\nwas originally designed for optimizing non-smooth functions, to the optimization of convex-concave\nproblems. In particular, we regard the MKL problem as a saddle point problem. In the present work,\nsimilar to the SILP method, we construct in each iteration a cutting plane model for the target\nobjective function using the solutions to the intermediate SVM problems. A new solution for kernel\nweights is obtained by solving the cutting plane model. We furthermore adjust the new solution via a\nprojection to a level set. This adjustment is critical in that it ensures on one hand the new solution is\nsuf\ufb01ciently close to the current solution, and on the other hand the new solution signi\ufb01cantly reduces\nthe objective function. We show that the extended level method has a convergence rate of O(1/\u03b52)\nfor a \u03b5-accurate solution. Although this is similar to that of the SD method, the extended level\nmethod is advantageous in that it utilizes all the gradients that have been computed so far. Empirical\nresults with eight UCI datasets show that the extended level method is able to greatly improve the\nef\ufb01ciency of multiple kernel learning in comparison with the SILP method and the SD method.\n\nThe rest of this paper is organized as follows.\nIn section 2, we review the ef\ufb01cient algorithms\nthat have been designed for multiple kernel learning. In section 3, we describe the details of the\nextended level method for MKL, including a study of its convergence rate. In section 4, we present\nexperimental results by comparing both the effectiveness and the ef\ufb01ciency of the extended level\nmethod with the corresponding measures of SILP and SD. We conclude this work in section 5.\n\n2 Related Work\n\nLet X = (x1, . . . , xn) \u2208 Rn\u00d7d denote the collection of n training samples that are in a d-\ndimensional space. We further denote by y = (y1, y2, . . . , yn) \u2208 {\u22121, +1}n the binary class labels\nfor the data points in X. We employ the maximum margin classi\ufb01cation error, an objective used\nin SVM, as the generalized performance measure. Following [5], the problem of multiple kernel\nlearning for classi\ufb01cation in the primal form is de\ufb01ned as follows:\n\nmin\np\u2208P\n\nmax\n\u03b1\u2208Q\n\nf (p, \u03b1) = \u03b1\u22a4e \u2212\n\n1\n2\n\n(\u03b1 \u25e6 y)\u22a4 m\nXi=1\n\npiKi! (\u03b1 \u25e6 y),\n\n(1)\n\nwhere P = {p \u2208 Rm : p\u22a4e = 1, 0 \u2264 p \u2264 1} and Q = {\u03b1 \u2208 Rn : \u03b1\u22a4y = 0, 0 \u2264 \u03b1 \u2264 C}\nare two solid convex regions, denoting the set of kernel weights and the set of SVM dual variables,\ni=1 is a group\nrespectively. Here, e is a vector of all ones, C is the trade-off parameter in SVM, {Ki}m\nof base kernel matrices, and \u25e6 de\ufb01nes the element-wise product between two vectors. It is easy to\nverify that f (p, \u03b1) is convex on p and concave on \u03b1. Thus the above optimization problem is indeed\na convex-concave problem. It is important to note that the block-minimization formulation of MKL\npresented in [10, 2] is equivalent to (1).\n\nA straightforward approach toward solving the convex-concave problem in (1) is to transform it\ninto a Semi-de\ufb01nite Programming (SDP) or a Quadratically Constrained Quadratic Programming\n(QCQP) [5, 2]. However, given their computational complexity, they cannot be applied to large-\nscale MKL problems. Recently, Semi-in\ufb01nite Linear Programming (SILP) [12] and Subgradient\nDescent (SD) [10] have been applied to handle large-scale MKL problems. We summarize them\ninto a uni\ufb01ed framework in Algorithm 1. Note that a superscript is used to indicate the index of\niteration, a convention that is used throughout this paper. We use [x]t to denote x to the power of t\nin the case of ambiguity.\n\nAs indicated in Algorithm 1, both methods divide the MKL problem into two cycles: the inner cycle\nsolves a standard SVM problem to update \u03b1, and the outer cycle updates the kernel weight vector\n\n\fAlgorithm 1 A general framework for solving MKL\n1: Initialize p0 = e/m and i = 0\n2: repeat\n3:\n4:\n5:\n6: until \u2206i \u2264 \u03b5\n\nSolve the dual of SVM with kernel K =Pm\nUpdate kernel weights by pi+1 = arg min{\u03d5i(p; \u03b1) : p \u2208 P}\nUpdate i = i + 1 and calculate stopping criterion \u2206i\n\nj=1 pi\n\njKj and obtain optimal solution \u03b1i\n\np. They differ in the 4th step in Algorithm 1: the SILP method updates p by solving a cutting\nplane model, while the SD method updates p using the subgradient of the current solution. More\nspeci\ufb01cally, \u03d5i(p; \u03b1) for SILP and SD are de\ufb01ned as follows:\n\n\u03d5i\n\nSILP (p; \u03b1) = min\n\n\u03d5i\n\nSD(p; \u03b1) =\n\n\u03bd {\u03bd : \u03bd \u2265 f (p, \u03b1j), j = 0, . . . , i},\n1\n2kp \u2212 pik2\n\n2 + \u03b3i(p \u2212 pi)\u22a4\u2207pf (pi, \u03b1i),\n\n(2)\n\n(3)\n\nwhere \u03b3i is the step size that needs to be decided dynamically (e.g., by a line search). \u2207pf (pi, \u03b1i) =\n\u2212 1\n2 [(\u03b1i\u25e6 y)\u22a4K1(\u03b1i\u25e6 y), . . . , (\u03b1i\u25e6 y)\u22a4Km(\u03b1i\u25e6 y)]\u22a4 denotes the subgradient of f (\u00b7,\u00b7) with respect\nto p at (pi, \u03b1i). Comparing the two methods, we observe\n\n\u2022 In SILP, the cutting plane model \u03d5SILP (p) utilizes all the {\u03b1j}i\ntions. In contrast, SD only utilizes \u03b1i of the current solution pi.\n\u2022 SILP updates the solution for p based on the cutting plane model \u03d5SILP (p). Since the\ncutting plane model is usually inaccurate when p is far away from {pj}i\nj=1, the updated\nsolution p could be signi\ufb01cantly off target [3]. In contrast, a regularization term kp \u2212\n2/2 is introduced in SD to prevent the new solution being far from the current one, pi.\npik2\nThe proposed level method combines the strengths of both methods. Similar to SILP, it utilizes the\ngradient information of all the iterations; similar to SD, a regularization scheme is introduced to\nprevent the updated solution from being too far from the current solution.\n\nj=1 obtained in past itera-\n\n3 Extended Level Method for MKL\n\nWe \ufb01rst introduce the basic steps of the level method, followed by the extension of the level method\nto convex-concave problems and its application to MKL.\n\n3.1\n\nIntroduction to the Level Method\n\ni\n\nThe level method [6] is from the family of bundle methods, which have recently been employed to\nef\ufb01ciently solve regularized risk minimization problems [11]. It is an iterative approach designed\nfor optimizing a non-smooth objective function. Let f (x) denote the convex objective function\nto be minimized over a convex domain G. In the ith iteration, the level method \ufb01rst constructs a\nlower bound for f (x) by a cutting plane model, denoted by gi(x). The optimal solution, denoted\ni and a\nby \u02c6xi, that minimizes the cutting plane model gi(x) is then computed. An upper bound f\nlower bound f\nare computed for the optimal value of the target optimization problem based on\n\u02c6xi. Next, a level set for the cutting plane model gi(x) is constructed, denoted by Li = {x \u2208 G :\n+ (1 \u2212 \u03bb)f i} where \u03bb \u2208 (0, 1) is a tradeoff constant. Finally, a new solution xi+1\ngi(x) \u2264 \u03bbf\nis computed by projecting xi onto the level set Li. It is important to note that the projection step,\nserving a similar purpose to the regularization term in SD, prevents the new solution xi+1 from\nbeing too far away from the old one xi. To demonstrate this point, consider a simple example\nminx{f (x) = [x]2 : x \u2208 [\u22124, 4]}. Assume x0 = \u22123 is the initial solution. The cutting plane\nmodel at x0 is g0(x) = 9 \u2212 6(x + 3). The optimal solution minimizing g0(x) is \u02c6x1 = 4. If we\ndirectly take \u02c6x1 as the new solution, as SILP does, we found it is signi\ufb01cantly worse than x0 in\nterms of [x]2. The level method alleviates this problem by projecting x0 = \u22123 to the level set\nL0 = {x : g0(x) \u2264 0.9[x0]2 + 0.1g0(\u02c6x1),\u22124 \u2264 x \u2264 4} where \u03bb = 0.9. It is easy to verify that\n\ni\n\n\fthe projection of x0 to L0 is x1 = \u22122.3, which signi\ufb01cantly reduces the objective function f (x)\ncompared with x0.\n\n3.2 Extension of the Level Method to MKL\n\nWe now extend the level method, which was originally designed for optimizing non-smooth func-\ntions, to convex-concave optimization. First, since f (p, \u03b1) is convex in p and concave in \u03b1, accord-\ning to van Neuman Lemma, for any optimal solution (p\u2217, \u03b1\u2217) we have\nf (p, \u03b1) \u2265 f (p\u2217, \u03b1\u2217) \u2265 f (p\u2217, \u03b1) = min\n\nf (p, \u03b1\u2217) = max\n\u03b1\u2208Q\n\nf (p, \u03b1).\n\n(4)\n\np\u2208P\n\nThis observation motivates us to design an MKL algorithm which iteratively updates both the lower\nand the upper bounds for f (p, \u03b1) in order to \ufb01nd the saddle point. To apply the level method, we\n\ufb01rst construct the cutting plane model. Let {pj}i\nj=1 denote the solutions for p obtained in the last\ni iterations. Let \u03b1j = arg max\u03b1\u2208Q f (pj, \u03b1) denote the optimal solution that maximizes f (pj, \u03b1).\nWe construct a cutting plane model gi(p) as follows:\n\ngi(p) = max\n1\u2264j\u2264i\n\nf (p, \u03b1j).\n\n(5)\n\nWe have the following proposition for the cutting plane model gi(x)\nProposition 1. For any p \u2208 P, we have (a) gi+1(p) \u2265 gi(p), and (b) gi(p) \u2264 max\u03b1\u2208Q f (p, \u03b1).\nNext, we construct both the lower and the upper bounds for the optimal value f (p\u2217, \u03b1\u2217). We de\ufb01ne\ntwo quantities f i and f\n\ni as follows:\n\nf i = min\np\u2208P\n\ngi(p) and f\n\ni\n\n= min\n1\u2264j\u2264i\n\nf (pj, \u03b1j).\n\n(6)\n\nThe following theorem shows that {f j}i\nbounds for f (p\u2217, \u03b1\u2217).\nTheorem 1. We have the following properties for {f j}i\ni, and (c) f 1 \u2264 f 2 \u2264 . . . \u2264 f i.\n(b) f\n\nj=1 and {f\n\n\u2265 . . . \u2265 f\n\n\u2265 f\n\n2\n\n1\n\nj\n\nj=1 provide a series of increasingly tight\n}i\n\nj=1 and {f\n\nj\n\nj=1: (a) f i \u2264 f (p\u2217, \u03b1\u2217) \u2264 f\n}i\n\ni,\n\nProof. First, since gi(p) \u2264 max\u03b1\u2208Q f (p, \u03b1) for any p \u2208 P, we have\nf (p, \u03b1).\n\nf i = min\np\u2208P\n\ngi(p) \u2264 min\n\np\u2208P\n\nmax\n\u03b1\u2208Q\n\nSecond, since f (pj, \u03b1j) = max\n\u03b1\u2208Q\n\nf (pj, \u03b1), we have\n\ni\n\nf\n\n= min\n1\u2264j\u2264i\n\nf (pj, \u03b1j) =\n\nmin\n\np\u2208{p1,...,pi}\n\nmax\n\u03b1\u2208Q\n\nf (p, \u03b1) \u2265 min\n\np\u2208P\n\nmax\n\u03b1\u2208Q\n\nf (p, \u03b1) = f (p\u2217, \u03b1\u2217).\n\nCombining the above results, we have (a) in the theorem. It is easy to verify (b) and (c).\n\nWe furthermore de\ufb01ne the gap \u2206i as\n\n\u2206i = f\n\ni\n\n\u2212 f i.\n\nThe following corollary indicates that the gap \u2206i can be used to measure the sub-optimality for\nsolution pi and \u03b1i.\nCorollary 2. (a) \u2206j \u2265 0, j = 1, . . . , i, (b) \u22061 \u2265 \u22062 \u2265 . . . \u2265 \u2206i, (c) |f (pj, \u03b1j)\u2212f (p\u2217, \u03b1\u2217)| \u2264 \u2206i\nIt is easy to verify these three properties of \u2206i in the above corollary using the results of Theorem 1.\nIn the third step, we construct the level set Li using the estimated bounds f\n\ni and f i as follows:\n\nLi = {p \u2208 P : gi(p) \u2264 \u2113i = \u03bbf\n\ni\n\n+ (1 \u2212 \u03bb)f i},\n\n(7)\n\n\fwhere \u03bb \u2208 (0, 1) is a prede\ufb01ned constant. The new solution, denoted by pi+1, is computed as\nthe projection of pi onto the level set Li, which is equivalent to solving the following optimization\nproblem:\n\npi+1 = arg min\n\np\n\n(cid:8)kp \u2212 pik2\n\n2 : p \u2208 P, f (p, \u03b1j) \u2264 \u2113i, j = 1, . . . , i(cid:9) .\n\n(8)\n\nAlthough the projection is regarded as a quadratic programming problem, it can often be solved ef-\n\ufb01ciently because its solution is likely to be the projection onto one of the hyperplanes of polyhedron\nLi. In other words, only very few linear constraints of L are active; most of them are inactive. This\nsparse nature usually leads to signi\ufb01cant speedup of QP, similar to the solver of SVM. As we argue\nin the last subsection, by means of the projection, we on the one hand ensure pi+1 is not very far\naway from pi, and on the other hand ensure signi\ufb01cant progress is made in terms of gi(p) when the\nsolution is updated from pi to pi+1. Note that the projection step in the level method saves the effort\nof searching for the optimal step size in SD, which is computationally expensive as will be revealed\nlater. We summarize the steps of the extended level method in Algorithm 2.\n\nAlgorithm 2 The Level Method for Multiple Kernel Learning\n1: Initialize p0 = e/m and i = 0\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8: until \u2206i \u2264 \u03b5\n\nSolve the dual problem of SVM with K =Pm\nConstruct the cutting plane model gi(p) in (5)\nCalculate the lower bound f i and the upper bound f\nCompute the projection of pi onto the level set Li by solving the optimization problem in (8)\nUpdate i = i + 1\n\nj Kj to obtain the optimal solution \u03b1i\n\ni in (6), and the gap \u2206i in (3.2)\n\nj=1 pi\n\nFinally, we discuss the convergence behavior of the level method. In general, convergence is guar-\nanteed because the gap \u2206i, which bounds the absolute difference between f (p\u2217, \u03b1\u2217) and f (pi, \u03b1i),\nmonotonically decreases through iterations. The following theorem shows the convergence rate of\nthe level method when applied to multiple kernel learning.\nTheorem 3. To obtain a solution p that satis\ufb01es the stopping criterion, i.e., | max\u03b1\u2208Q f (p, \u03b1) \u2212\nf (p\u2217, \u03b1\u2217)| \u2264 \u03b5, the maximum number of iterations N that the level method requires is bounded\nas follows N \u2264 2c(\u03bb)L2\n(1\u2212\u03bb)2\u03bb(2\u2212\u03bb) and L = 1\n\u039bmax(Ki). The\noperator \u039bmax(M ) computes the maximum eigenvalue of matrix M.\n\n2\u221amnC 2 max\n\n, where c(\u03bb) =\n\n1\u2264i\u2264m\n\n\u03b52\n\n1\n\nDue to space limitation, the proof of Theorem 3 can be found in the long version of this paper.\nTheorem 3 tells us that the convergence rate of the level method is O(1/\u03b52). It is important to note\nthat according to Information Based Complexity (IBC) theory, given a function family F(L) with a\n\ufb01xed Lipschitz constant L, O(1/\u03b52) is almost the optimal convergence rate that can be achieved for\nany optimization method based on the black box \ufb01rst order oracle. In other words, no matter which\noptimization method is used, there always exists an function f (\u00b7) \u2208 F(L) such that the convergence\nrate is O(1/\u03b52) as long as the optimization method is based on a black box \ufb01rst order oracle. More\ndetails can be found in [8, 6].\n\n4 Experiments\n\nWe conduct experiments to evaluate the ef\ufb01ciency of the proposed algorithm for MKL in constrast\nwith SILP and SD, the two state-of-the-art algorithms for MKL.\n\n4.1 Experimental Setup\n\nWe follow the settings in [10] to construct the base kernel matrices, i.e.,\n\nsingle feature\n\n\u2022 Gaussian kernels with 10 different widths ({2\u22123, 2\u22122, . . . , 26}) on all features and on each\n\u2022 Polynomial kernels of degree 1 to 3 on all features and on each single feature.\n\n\fTable 1: The performance comparison of three MKL algorithms. Here n and m denote the size of\ntraining samples and the number of kernels, respectively.\n\nSD\n\nSILP\n\nLevel\n\nTime(s)\nAccuracy (%)\n#Kernel\n\nTime(s)\nAccuracy (%)\n#Kernel\n\nTime(s)\nAccuracy (%)\n#Kernel\n\nTime(s)\nAccuracy (%)\n#Kernel\n\nIono\n33.5 \u00b111.6\n92.1 \u00b12.0\n26.9 \u00b14.0\nPima\n39.4 \u00b18.8\n76.9 \u00b11.9\n16.6 \u00b12.2\nWpbc\n\n7.8 \u00b12.4\n77.0 \u00b12.9\n19.5 \u00b12.8\nVote\n23.7 \u00b19.7\n95.7 \u00b11.0\n14.0 \u00b13.6\n\nn = 175 m = 442\n1161.0 \u00b1344.2\n\n92.0 \u00b11.9\n24.4 \u00b13.4\n\n7.1 \u00b14.3\n92.1\u00b11.9\n25.4\u00b13.9\n\nn = 384 m = 117\n\n62.0 \u00b115.2\n76.9 \u00b12.1\n12.0 \u00b11.8\n\n9.1 \u00b11.6\n76.9\u00b12.1\n17.6\u00b12.6\n\nn = 198 m = 442\n142.0 \u00b1122.3\n\n76.9 \u00b12.8\n17.2 \u00b12.2\n\n5.3 \u00b11.3\n76.9\u00b12.9\n20.3\u00b12.6\n\nn = 218 m = 205\n\n26.3 \u00b112.4\n95.7 \u00b11.0\n10.6 \u00b12.6\n\n4.1 \u00b11.3\n95.7\u00b11.0\n13.8\u00b12.6\n\nSD\nBreast\n47.4 \u00b18.9\n96.6 \u00b10.9\n13.1 \u00b11.7\nSonar\n60.1 \u00b129.6\n79.1 \u00b14.5\n39.8 \u00b13.9\nHeart\n4.7 \u00b12.8\n82.2 \u00b12.2\n17.5 \u00b11.8\nWdbc\n122.9\u00b138.2\n96.7 \u00b10.8\n16.6 \u00b13.2\n\nSILP\n\nLevel\n\nn = 342 m = 117\n\n54.2 \u00b19.4\n96.6 \u00b10.8\n10.6 \u00b11.1\n\n4.6 \u00b11.0\n96.6\u00b10.8\n13.3\u00b11.5\n\nn = 104 m = 793\n1964.3\u00b168.4\n\n79.3 \u00b14.2\n34.2 \u00b12.6\n\n24.9\u00b110.6\n79.0\u00b14.7\n38.6\u00b14.1\n\nn = 135 m = 182\n79.2 \u00b138.1\n82.2 \u00b12.0\n15.2 \u00b11.5\n\n2.1 \u00b10.4\n82.2\u00b12.1\n18.6\u00b11.9\n\nn = 285 m = 403\n146.3 \u00b148.3\n96.5 \u00b10.9\n12.9 \u00b12.3\n\n15.5\u00b17.5\n96.7\u00b10.8\n15.6\u00b13.0\n\nEach base kernel matrix is normalized to unit trace. The experiments are conducted on a PC with\n3.2GHz CPU and 2GB memory. According to the above scheme of constructing base kernel matri-\nces, we select a batch of UCI data sets, with the cardinality and dimension allowed by the memory\nlimit of the PC, from the UCI repository for evaluation. We repeat all the algorithms 20 times for\neach data set. In each run, 50% of the examples are randomly selected as the training data and the\nremaining data are used for testing. The training data are normalized to have zero mean and unit\nvariance, and the test data are then normalized using the mean and variance of the training data. The\nregularization parameter C in SVM is set to 100 as our focus is to evaluate the computational time, as\njusti\ufb01ed in [10]. For a fair comparison among the MKL algorithms, we adopt the same stopping cri-\nterion for all three algorithms under comparison: we adopt the duality gap criterion used in [10], i.e.,\nmax\n1\u2264i\u2264m\nis less than 0.01 or the number of iterations larger than 500. We empirically initialize the parameter \u03bb\nto 0.9 and increase it to 0.99 when the ratio \u2206i/\u2113i is less than 0.01 for all experiments, since a larger\n\u03bb accelerates the projection when the solution is close to the optimal one. We use the SimpleMKL\ntoolbox [10] to implement the SILP and SD methods. The linear programming in the SILP method\nand the auxiliary subproblems in the level method are solved using a general optimization toolbox\nMOSEK (http://www.mosek.com). The toolbox for the level method can be downloaded from\nhttp://www.cse.cuhk.edu.hk/\u02dczlxu/toolbox/level_mkl.html.\n\nj=1 pjKj(cid:17) (\u03b1\u25e6y), and stop the algorithm when the criterion\n\n(\u03b1\u25e6y)\u22a4Ki(\u03b1\u25e6y)\u2212(\u03b1\u25e6y)\u22a4(cid:16)Pm\n\n4.2 Experimental Results\n\nWe report the following performance measures: prediction accuracy, training time, and the averaged\nnumber of kernels selected. From Table 1, we observe that all algorithms achieve almost the same\nprediction accuracy under the same stopping criterion. This is not surprising because all algorithms\nare essentially trying to solve the same optimization problem. Regarding the computational ef\ufb01-\nciency, we observe that the time cost of the SILP approach is the highest among all the three MKL\nalgorithms. For datasets \u201cIono\u201d and \u201cSonar\u201d, the SILP method consumes more than 30 times the\ncomputational cycles of the other two methods for MKL. We also observe that the level method is the\nmost ef\ufb01cient among three methods in comparison. To obtain a better picture of the computational\nef\ufb01ciency of the proposed level method, we compute the time-saving ratio, as shown in Table 2. We\nobserve that the level method saves 91.9% of computational time on average when compared with\nthe SILP method, and 70.3% of computational time when compared with the SD method.\nIn order to see more details of each optimization algorithm, we plot the logarithm values of the\nMKL objective function to base 10 against time in Figure 1. Due to space limitation, we randomly\nchoose only three datasets, \u201cIono\u201d, \u201cBreast\u201d, and \u201cPima\u201d, as examples. It is interesting to \ufb01nd that\nthe level method converges overwhelmingly faster than the other two methods. The ef\ufb01ciency of the\nlevel method arises from two aspects: (a) the cutting plane model utilizes the computational results\nof all iterations and therefore boosts the search ef\ufb01ciency, and (b) the projection to the level sets\nensures the stability of the new solution. A detailed analysis of the SD method reveals that a large\n\n\fnumber of function evaluations are consumed in order to compute the optimal stepsize via a line\nsearch. Note that in convex-concave optimization, every function evaluation in the line search of SD\nrequires solving an SVM problem. As an example, we found that for dataset \u201cIono\u201d, although SD\nand the level method require similar numbers of iterations, SD calls the SVM solver 1231 times on\naverage, while the level method only calls it 47 times. For the SILP method, the high computational\ncost is mainly due to the oscillation of solutions. This instability leads to very slow convergence\nwhen the solution is close to the optimal one, as indicated by the long tail of SILP in Figure 1. The\ninstability of SILP is further con\ufb01rmed by the examination of kernel weights, as shown below.\n\nTo understand the evolution of kernel weights (i.e., p), we plot the evolution curves of the \ufb01ve largest\nkernel weights for datasets \u201cIono\u201d, \u201cBreast\u201d, and \u201cPima\u201d in Figure 2. We observe that the values\nof p computed by the SILP method are the most unstable due to oscillation of the solutions to the\ncutting plane models. Although the unstable-solution problem is to some degree improved by the\nSD method, we still clearly observe that p \ufb02uctuates signi\ufb01cantly through iterations. In contrast,\nfor the proposed level method, the values of p change smoothly through iterations. We believe that\nthe stability of the level method is mainly due to the accurate estimation of bounds as well as the\nregularization of the projection to the level sets. This observation also sheds light on why the level\nmethod can be more ef\ufb01cient than the SILP and the SD methods.\n\nTable 2: Time-saving ratio of the level method over the SILP and the SD method\n\nLevel/SD (%)\nLevel/SILP (%)\n\nIono\n78.9\n99.4\n\nBreast\n90.4\n91.6\n\nPima\n77.0\n85.4\n\nSonar Wpbc Heart Vote Wdbc Average\n58.7\n98.7\n\n54.7\n97.3\n\n32.5\n88.7\n\n82.8\n84.5\n\n87.4\n89.4\n\n70.3\n91.9\n\nEvolution of objective values with time\n\nSD\nSILP\nLevel\n\ne\nv\ni\nt\nc\ne\nj\nb\no\n \nf\no\n \ng\no\nl\n\n3.8\n\n3.75\n\n3.7\n\n3.65\n\n3.6\n\n3.55\n\n3.5\n\n3.45\n\nEvolution of objective values with time\n\nSD\nSILP\nLevel\n\n3.75\n\n3.7\n\n3.65\n\n3.6\n\n3.55\n\ne\nv\ni\nt\nc\ne\nj\nb\no\n \nf\no\n \ng\no\nl\n\n4.4\n\n4.38\n\n4.36\n\n4.34\n\n4.32\n\n4.3\n\n4.28\n\n4.26\n\n4.24\n\n4.22\n\ne\nv\ni\nt\nc\ne\nj\nb\no\n \nf\no\n \ng\no\nl\n\nEvolution of objective values with time\n\nSD\nSILP\nLevel\n\n3.4\n\n0\n\n20\n\n40\n60\ntime (s)\n(a) Iono\n\n80\n\n100\n\n3.5\n\n0\n\n10\n\n20\n\n30\n\ntime (s)\n\n40\n\n50\n\n60\n\n4.2\n\n0\n\n10\n\n20\n\n40\n\n30\ntime (s)\n\n50\n\n60\n\n70\n\n(b) Breast\n\n(c) Pima\n\nFigure 1: Evolution of objective values over time (seconds) for datasets \u201cIono\u201d, \u201cBreast\u201d, and\n\u201cPima\u201d. The objective values are plotted on a logarithm scale (base 10) for better comparison.\nOnly parts of the evolution curves are plotted for SILP due to their long tails.\n\n5 Conclusion and Future Work\nIn this paper, we propose an extended level method to ef\ufb01ciently solve the multiple kernel learning\nproblem. In particular, the level method overcomes the drawbacks of both the SILP method and\nthe SD method for MKL. Unlike the SD method that only utilizes the gradient information of the\ncurrent solution, the level method utilizes the gradients of all the solutions that are obtained in past\niterations; meanwhile, unlike the SILP method that updates the solution only based on the cutting\nplane model, the level method introduces a projection step to regularize the updated solution. It\nis the employment of the projection step that guarantees \ufb01nding an updated solution that, on the\none hand, is close to the existing one, and one the other hand, signi\ufb01cantly reduces the objective\nfunction. Our experimental results have shown that the level method is able to greatly reduce the\ncomputational time of MKL over both the SD method and the SILP method. For future work, we\nplan to \ufb01nd a scheme to adaptively set the value of \u03bb in the level method and apply the level method\nto other tasks, such as one-class classi\ufb01cation, multi-class classi\ufb01cation, and regression.\n\nAcknowledgement\nThe work was supported by the National Science Foundation (IIS-0643494), National Institute of Health\n(1R01GM079688-01) and Research Grants Council of Hong Kong (CUHK4150/07E and CUHK4125/07).\nReferences\n\n[1] F. R. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning\n\nResearch, 9:1179\u20131225, 2008.\n\n\fEvolution of the kernel weight values in SD\n\nEvolution of the kernel weight values in SILP\n\nEvolution of the kernel weight values in Level method\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\np\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\np\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\np\n\n \n\n0\n\n0\n\n20\n\n40\n\n60\n\niteration\n\n80\n\n100\n\n0\n\n0\n\n100\n\n200\n300\niteration\n\n400\n\n500\n\n0\n\n0\n\n5\n\n10\n\n15\n20\niteration\n\n25\n\n30\n\n35\n\n(a) Iono/SD\n\nEvolution of the kernel weight values in SD\n\n(b) Iono/SILP\n\nEvolution of the kernel weight values in SILP\n\n(c) Iono/Level\n\nEvolution of the kernel weight values in Level method\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\n\n \n\np\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\n\n \n\np\n\n0\n\n0\n\n5\n\n10\n15\niteration\n\n20\n\n25\n\n0\n\n0\n\n20\n\n40\n\n60\n80\niteration\n\n100\n\n120\n\n140\n\n(d) Breast/SD\n\nEvolution of the kernel weight values in SD\n\n(e) Breast/SILP\n\nEvolution of the kernel weight values in SILP\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\n \np\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\n \np\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ne\nu\nl\na\nv\n\n \n\np\n\ns\ne\nu\nl\na\nv\n \np\n\n5\n\n10\n\niteration\n\n15\n\n20\n\n(f) Breast/Level\n\nEvolution of the kernel weight values in Level method\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\niteration\n\n25\n\n30\n\n0\n\n0\n\n20\n\n40\n\n60\n\niteration\n\n80\n\n100\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\niteration\n\n25\n\n30\n\n(g) Pima/SD\n\n(h) Pima/SILP\n\n(i) Pima/Level\n\nFigure 2: The evolution curves of the \ufb01ve largest kernel weights for datasets \u201cIono\u201d, \u201cBreast\u201d and\n\u201cPima\u201d computed by the three MKL algorithms\n\n[2] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO\n\nalgorithm. In ICML, 2004.\n\n[3] J. Bonnans, J. Gilbert, C. Lemar\u00b4echal, and C. Sagastiz\u00b4abal. Numerical Optimization, Theoretical and\n\nPractical Aspects. Springer-Verlag, Berlin, 2nd ed., 2006.\n\n[4] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In NIPS 13,\n\npages 367\u2013373, 2001.\n\n[5] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5, 2004.\n\n[6] C. Lemar\u00b4echal, A. Nemirovski, and Y. Nesterov. New variants of bundle methods. Mathematical Pro-\n\ngramming, 69(1), 1995.\n\n[7] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6, 2005.\n\n[8] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. John Wiley\n\nand Sons Ltd, 1983.\n\n[9] C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal of Machine\n\nLearning Research, 6, 2005.\n\n[10] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Technical Report HAL-\n\n00218338, INRIA, 2008.\n\n[11] A. Smola, S. V. N. Vishwanathan, and Q. Le. Bundle methods for machine learning. In NIPS 20, pages\n\n1377\u20131384, 2007.\n\n[12] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. Journal of\n\nMachine Learning Research, 7, 2006.\n\n[13] J. Ye, J. Chen, and S. Ji. Discriminant kernel and regularization parameter learning via semide\ufb01nite\n\nprogramming. In ICML, 2007.\n\n[14] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In ICML, 2007.\n\n\f", "award": [], "sourceid": 143, "authors": [{"given_name": "Zenglin", "family_name": "Xu", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Irwin", "family_name": "King", "institution": null}, {"given_name": "Michael", "family_name": "Lyu", "institution": null}]}