ABSTRACT
We address the problem of learning a kernel for a given supervised learning task. Our approach consists in searching within the convex hull of a prescribed set of basic kernels for one which minimizes a convex regularization functional. A unique feature of this approach compared to others in the literature is that the number of basic kernels can be infinite. We only require that they are continuously parameterized. For example, the basic kernels could be isotropic Gaussians with variance in a prescribed interval or even Gaussians parameterized by multiple continuous parameters. Our work builds upon a formulation involving a minimax optimization problem and a recently proposed greedy algorithm for learning the kernel. Although this optimization problem is not convex, it belongs to the larger class of DC (difference of convex functions) programs. Therefore, we apply recent results from DC optimization theory to create a new algorithm for learning the kernel. Our experimental results on benchmark data sets show that this algorithm outperforms a previously proposed method.
- Argyriou, A., Micchelli, C. A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. Proc. of the 18th Conference on Learning Theory, pages 338--352.]] Google ScholarDigital Library
- Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337--404.]]Google ScholarCross Ref
- Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. Proc. of the 21st International Conference on Machine Learning.]] Google ScholarDigital Library
- Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, Springer, 2002.]] Google ScholarDigital Library
- Ellis, S., & Nayakkankuppam, M. (2003). Phylogenetic analysis via DC programming (preprint). Department of Mathematics and Statistics, UMBC, Baltimore, MD.]]Google Scholar
- Hartman, P. (1959). On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9: 707--713.]]Google ScholarCross Ref
- Horst, R., & Thoai, N. V. (1999). DC programming: overview. Journal of Optimization Theory and Applications, 103, 1--41.]]Google ScholarDigital Library
- Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El-Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27--72.]] Google ScholarDigital Library
- Lin, Y., & Zhang, H. H. (2003). Component selection and smoothing in smoothing spline analysis of variance models -- cosso (Technical Report 2556). Institute of Statistics Mimeo Series, NCSU.]]Google Scholar
- Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. J. of Machine Learning Research, 6: 1099--1125, 2005.]] Google ScholarDigital Library
- Micchelli, C. A., & Pontil, M. (2005b). Feature space perspectives for learning the kernel. To appear in Machine Learning (see also: Technical Report RN/05/11, Dept. of Computer Science, UCL).]]Google Scholar
- Micchelli, C. A., Pontil, M., Wu, Q., & Zhou, D-X. (2005c). Error bounds for learning the kernel. Research Note RN/05/09, Dept of Computer Science, UCL, 2005.]]Google Scholar
- Neumann, J., Schnörr, C., Steidl, G. (2004). SVM-based feature selction by direct objective minimisation. C. E. Rasmussen et al (eds.), Proc. of 26th DAGM Symposium.]]Google Scholar
- Ong, C. S., Smola, A. J. & Williamson. R. C. (2003). Hyperkernels. Advances in Neural Information Processing Systems, 15, S. Becker et. al (Eds.), MIT Press, Cambridge, MA.]]Google Scholar
- Rasmussen, C. E. & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, Cambridge, MA.]] Google ScholarCross Ref
- Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Annals of Mathematics, 39, 811--841.]]Google ScholarCross Ref
- Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels. The MIT Press, Cambridge, MA.]]Google Scholar
- Shawe-Taylor, J., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press.]] Google ScholarCross Ref
- Sonnenburg, S., Raetsch, G. & Schaefer, C. (2006). A General and efficient multiple kernel learning algorithm In Advances in Neural Information Processing Systems, 19.]]Google Scholar
- Steinig, J. (1986). A rule of signs for real exponential polynomials. Acta Math. Hungar. 47 (1), 187--190.]]Google ScholarCross Ref
Index Terms
- A DC-programming algorithm for kernel selection
Recommendations
Online kernel selection: algorithms and evaluations
AAAI'12: Proceedings of the Twenty-Sixth AAAI Conference on Artificial IntelligenceKernel methods have been successfully applied to many machine learning problems. Nevertheless, since the performance of kernel methods depends heavily on the type of kernels being used, identifying good kernels among a set of given kernels is important ...
Eigenvalues ratio for kernel selection of kernel methods
AAAI'15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial IntelligenceThe selection of kernel function which determines the mapping between the input space and the feature space is of crucial importance to kernel methods. Existing kernel selection approaches commonly use some measures of generalization error, which are ...
A Kernel-Induced Space Selection Approach to Model Selection in KLDA
Model selection in kernel linear discriminant analysis (KLDA) refers to the selection of appropriate parameters of a kernel function and the regularizer. By following the principle of maximum information preservation , this paper formulates the ...
Comments