skip to main content
10.1145/3289600.3291011acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Asynchronous Training of Word Embeddings for Large Text Corpora

Published: 30 January 2019 Publication History

Abstract

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires $1/10$ of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

References

[1]
A. Almuhareb and M. Poesio. Concept learning and categorization from the web. In Proceedings of the Cognitive Science Society, volume 27, 2005.
[2]
W. F. Battig and W. E. Montague. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of experimental Psychology, 80(3p2):1, 1969.
[3]
M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. R. Burdick, and S. Vaithyanathan. Hybrid parallelization strategies for large-scale machine learning in systemml. Proceedings of the VLDB Endowment, 7(7):553--564, 2014.
[4]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
[5]
T. Boucher, C. Carey, S. Mahadevan, and M. D. Dyar. Aligning mixed manifolds. In AAAI, pages 2511--2517, 2015.
[6]
E. Bruni, N.-K. Tran, and M. Baroni. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014):1--47, 2014.
[7]
K. Cho, B. Van Merrië nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[8]
L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406--414. ACM, 2001.
[9]
J. Garten, K. Sagae, V. Ustun, and M. Dehghani. Combining distributed vector representations for words. In VS@ HLT-NAACL, pages 95--101, 2015.
[10]
S. Ghannay, B. Favre, Y. Esteve, and N. Camelin. Word embedding evaluation and combination. In LREC, 2016.
[11]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Data Engineering (ICDE), pages 231--242. IEEE, 2011.
[12]
J. Goikoetxea, E. Agirre, and A. Soroa. Single or multiple? combining word representations independently learned from text and wordnet. In AAAI, pages 2608--2614, 2016.
[13]
J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33--51, 1975.
[14]
S. Jastrzebski, D. Le'sniak, and W. M. Czarnecki. How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. arXiv preprint arXiv:1702.02170, 2017.
[15]
S. Ji, N. Satish, S. Li, and P. Dubey. Parallelizing word2vec in shared and distributed memory. arXiv preprint arXiv:1604.04661, 2016.
[16]
D. A. Jurgens, P. D. Turney, S. M. Mohammad, and K. J. Holyoak. Semeval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, pages 356--364, 2012.
[17]
O. Levy and Y. Goldberg. Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), pages 2177--2185, 2014.
[18]
Y. Li, L. Xu, F. Tian, L. Jiang, X. Zhong, and E. Chen. Word embedding revisited: A new representation learning and explicit matrix factorization perspective. In IJCAI, pages 3650--3656, 2015.
[19]
M.-T. Luong, R. Socher, and C. D. Manning. Better word representations with recursive neural networks for morphology. In CoNLL, Sofia, Bulgaria, 2013.
[20]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[21]
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In hlt-Naacl, volume 13, pages 746--751, 2013.
[22]
B. Mitra and N. Craswell. Neural models for information retrieval. arXiv preprint arXiv:1705.01509, 2017.
[23]
A. Murom"agi, K. Sirts, and S. Laur. Linear ensembles of word embedding models. arXiv preprint arXiv:1704.01419, 2017.
[24]
E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 83--84, 2016.
[25]
E. Ordentlich, L. Yang, A. Feng, P. Cnudde, M. Grbovic, N. Djuric, V. Radosavljevic, and G. Owens. Network-efficient distributed word2vec training system for large vocabularies. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1139--1148. ACM, 2016.
[26]
J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532--1543, 2014.
[27]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[28]
S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323--2326, 2000.
[29]
H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627--633, 1965.
[30]
P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1--10, 1966.
[31]
R. Socher, J. Bauer, C. D. Manning, et al. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 455--465, 2013.
[32]
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642, 2013.
[33]
R. Speer and J. Chin. An ensemble method to produce high-quality word embeddings. CoRR, abs/1604.01692, 2016.
[34]
S. Stergiou, Z. Straznickas, R. Wu, and K. Tsioutsiouliklis. Distributed negative sampling for word embeddings. In AAAI, pages 2569--2575, 2017.
[35]
V. Ustun, P. S. Rosenbloom, K. Sagae, and A. Demski. Distributed vector representations of words in the sigma cognitive architecture. In International Conference on Artificial General Intelligence, pages 196--207. Springer, 2014.
[36]
J. B. P. Vuurens, C. Eickhoff, and A. P. de Vries. Efficient parallel learning of word2vec. ICML 2016 Machine Learning workshop, 48, 2016.
[37]
J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011.
[38]
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49--67, 2015.
[39]
W. Yin and H. Schütze. Learning meta-embeddings by using ensembles of embedding sets. arXiv preprint arXiv:1508.04257, 2015.
[40]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.
[41]
H. Zamani and W. B. Croft. Estimating embedding vectors for queries. In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, pages 123--132. ACM, 2016.
[42]
K. Zhao, H. Hassan, and M. Auli. Learning translation models from monolingual continuous representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1527--1536, 2015.

Cited By

View all
  • (2025)High Performance Computing for Auto Supervised Machine Learning Training: Parallel-Distributed Implementation of the Word2Vec Algorithm for Training Word EmbeddingsHigh Performance Computing10.1007/978-3-031-80084-9_3(36-51)Online publication date: 14-Feb-2025
  • (2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
  • (2019)Generating Distributed Representation of User Movement for Extracting Detour SpotsProceedings of the 11th International Conference on Management of Digital EcoSystems10.1145/3297662.3365826(250-255)Online publication date: 12-Nov-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
January 2019
874 pages
ISBN:9781450359405
DOI:10.1145/3289600
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. alternating linear regression
  2. scalable training
  3. word embeddings

Qualifiers

  • Research-article

Funding Sources

  • ALEXANDRIA
  • SoBigData

Conference

WSDM '19

Acceptance Rates

WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)High Performance Computing for Auto Supervised Machine Learning Training: Parallel-Distributed Implementation of the Word2Vec Algorithm for Training Word EmbeddingsHigh Performance Computing10.1007/978-3-031-80084-9_3(36-51)Online publication date: 14-Feb-2025
  • (2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
  • (2019)Generating Distributed Representation of User Movement for Extracting Detour SpotsProceedings of the 11th International Conference on Management of Digital EcoSystems10.1145/3297662.3365826(250-255)Online publication date: 12-Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media