ABSTRACT
We describe and analyze two stochastic methods for l1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.
- Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167--175. Google ScholarDigital Library
- Bottou, L. (Web Page). Stochastic gradient descent examples. http://leon.bottou.org/projects/sgd.Google Scholar
- Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in Neural Information Processing Systems 20 (pp. 161--168).Google Scholar
- Bottou, L., & LeCunn, Y. (2005). On-line learning for very large datasets. Appl. Stoch. Model. Bus. and Ind., 21, 137--151. Google ScholarDigital Library
- Clarkson, K. (2008). Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 922--931). Google ScholarDigital Library
- Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge University Press. Google ScholarDigital Library
- Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l 1-ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning (pp. 272--279). Google ScholarDigital Library
- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32, 407--499.Google ScholarCross Ref
- Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Res. Logist. Quart., 3, 95--110.Google ScholarCross Ref
- Friedman, J., Hastie, T., & Tibshirani, R. (2008). Regularized paths for generalized linear models via coordinate descent (Technical Report). Department of Statistics, Stanford University.Google Scholar
- Genkin, A., Lewis, D., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49, 291--304.Google ScholarCross Ref
- Gentile, C. (2003). The robustness of the p-norm algorithms. Machine Learning, 53, 265--299. Google ScholarDigital Library
- Grove, A. J., Littlestone, N., & Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43, 173--210. Google ScholarDigital Library
- Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1--64. Google ScholarDigital Library
- Koh, K., Kim, S., & Boyd, S. (2007). An interior-point method for large-scale l 1-regularized logistic regression. Journal of Machine Learning Research, 8, 1519--1555. Google ScholarDigital Library
- Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. Advances in Neural Information Processing Systems 21 (pp. 905--912).Google Scholar
- Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285--318. Google ScholarCross Ref
- Luo, Z., & Tseng, P. (1992). On the convergence of coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72, 7--35. Google ScholarDigital Library
- Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proceedings of the 24th International Conference on Machine Learning (pp. 807--814). Google ScholarDigital Library
- Shalev-Shwartz, S., & Srebro, N. (2008). SVM optimization: Inverse dependence on training set size. Proceedings of the 25th International Conference on Machine Learning (pp. 928--935). Google ScholarDigital Library
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58, 267--288.Google ScholarCross Ref
- Tseng, P., & Yun, S. (2009). A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl., 140, 513--535.Google ScholarDigital Library
- Wu, T. T., & Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2, 224--244.Google ScholarCross Ref
- Zhang, T. (2003). Sequential greedy approximation for certain convex optimization problems. IEEE Transaction on Information Theory, 49, 682--691. Google ScholarDigital Library
- Zhang, T., & Oles, F. J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4, 5--31. Google ScholarDigital Library
Index Terms
- Stochastic methods for l1 regularized loss minimization
Recommendations
Stochastic Methods for l1-regularized Loss Minimization
We describe and analyze two stochastic methods for l1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only ...
Iterative reweighted minimization methods for $$l_p$$lp regularized unconstrained nonlinear programming
In this paper we study general $$l_p$$ l p regularized unconstrained minimization problems. In particular, we derive lower bounds for nonzero entries of the first- and second-order stationary points and hence also of local minimizers of the $$l_p$$ l p minimization problems. ...
Non-asymptotic analysis of stochastic methods for non-smooth non-convex regularized problems
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing SystemsStochastic Proximal Gradient (SPG) methods have been widely used for solving optimization problems with a simple (possibly non-smooth) regularizer in machine learning and statistics. However, to the best of our knowledge no non-asymptotic convergence ...
Comments