ABSTRACT
A new "herding" algorithm is proposed which directly converts observed moments into a sequence of pseudo-samples. The pseudo-samples respect the moment constraints and may be used to estimate (unobserved) quantities of interest. The procedure allows us to sidestep the usual approach of first learning a joint model (which is intractable) and then sampling from that model (which can easily get stuck in a local mode). Moreover, the algorithm is fully deterministic, avoiding random number generation) and does not need expensive operations such as exponentiation.
- Besag, J. (1977). Efficiency of pseudo-likelihood estimation for simple Gaussian fields. Biometrika, 64, 616--618.Google ScholarCross Ref
- Ganapathi, V., Vickrey, D., Duchi, J., & Koller, D. (2008). Constrained approximate maximum entropy learning. Proceedings of the Twenty-fourth Conference on Uncertainty in AI (pp. 196--203).Google Scholar
- Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721--741.Google ScholarDigital Library
- Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771--1800. Google ScholarDigital Library
- Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554--2558.Google ScholarCross Ref
- Hyvarinen, A. (2005). Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6, 695--709. Google ScholarDigital Library
- Jaynes, E. (1957). Information theory and statistical mechanics. Physical Review, 106, 620--630.Google ScholarCross Ref
- Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. COLT: Proceedings of the Workshop on Computational Learning Theory (pp. 125--133). Google ScholarDigital Library
- Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. Neural Information Processing Systems (pp. 447--454).Google Scholar
- Levina, A., Herrmann, J., & Geisel, T. (2007). Dynamical synapses causing self-organized criticality in neural networks. Nature Physics, 3, 857--860.Google ScholarCross Ref
- Maass, W., & Zador, A. M. (1998). Dynamic stochastic synapses as computational units. Advances in Neural Information Processing Systems (pp. 903--917). MIT Press. Google ScholarDigital Library
- Murray, I., & Ghahramani, Z. (2004). Bayesian learning in undirected graphical models: approximate MCMC algorithms. Proceedings of the 14th Annual Conference on Uncertainty in AI (pp. 392--399). Google ScholarDigital Library
- Neal, R. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71--113. Google ScholarDigital Library
- Pantic, L., Torres, J., Kappen, H., & Gielen, C. (2002). Associative memory with dynamic synapses. Neural Computation, 14, 2903--2923. Google ScholarDigital Library
- Parise, S., & Welling, M. (2005). Learning in markov random fields: An empirical study. Proc. of the Joint Statistical Meeting.Google Scholar
- Teh, Y., & Welling, M. (2002). The unified propagation and scaling algorithm. Neural Information Processing Systems (pp. 953--960).Google Scholar
- Tieleman, T. (2008). Training restricted boltzmann machines using approximations to the likelihood gradient. Proceedings of the International Conference on Machine Learning (pp. 1064--1071). Google ScholarDigital Library
- Welling, M., & Parise, S. (2006). Bayesian random fields: The Bethe-Laplace approximation. Proc. of the Conf. on Uncertainty in Artificial Intelligence (pp. 512--519).Google Scholar
- Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics An International Journal of Probability and Stochastic Processes, 65, 177--228.Google Scholar
- Yuille, A. (2004). The convergence of contrastive divergences. Advances in Neural Information Processing Systems (pp. 1593--1600).Google Scholar
- Zhu, S., & Liu, X. (2002). Learning in Gibbsian fields: How accurate and how fast can it be? IEEE Trans. on Pattern Analysis and Machine Intelligence, 24, 1001--1006. Google ScholarDigital Library
- Zhu, S., Wu, Z., & Mumford, D. (1997). Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627--1660. Google ScholarDigital Library
Index Terms
- Herding dynamical weights to learn
Recommendations
Optimally-weighted herding is Bayesian quadrature
UAI'12: Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial IntelligenceHerding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when ...
Entropic herding
AbstractHerding is a deterministic algorithm used to generate data points regarded as random samples satisfying input moment conditions. This algorithm is based on a high-dimensional dynamical system and rooted in the maximum entropy principle of ...
Comments