research-article

Asynchronous Training of Word Embeddings for Large Text Corpora

Authors:

Jaspreet Singh,

Jan-Hendrik Zab,

Zijian ZhangAuthors Info & Claims

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 168 - 176

https://doi.org/10.1145/3289600.3291011

Published: 30 January 2019 Publication History

Abstract

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires $1/10$ of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

References

[1]

A. Almuhareb and M. Poesio. Concept learning and categorization from the web. In Proceedings of the Cognitive Science Society, volume 27, 2005.

[2]

W. F. Battig and W. E. Montague. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of experimental Psychology, 80(3p2):1, 1969.

[3]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. R. Burdick, and S. Vaithyanathan. Hybrid parallelization strategies for large-scale machine learning in systemml. Proceedings of the VLDB Endowment, 7(7):553--564, 2014.

Digital Library

[4]

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.

[5]

T. Boucher, C. Carey, S. Mahadevan, and M. D. Dyar. Aligning mixed manifolds. In AAAI, pages 2511--2517, 2015.

Digital Library

[6]

E. Bruni, N.-K. Tran, and M. Baroni. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014):1--47, 2014.

Digital Library

[7]

K. Cho, B. Van Merrië nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[8]

L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406--414. ACM, 2001.

Digital Library

[9]

J. Garten, K. Sagae, V. Ustun, and M. Dehghani. Combining distributed vector representations for words. In VS@ HLT-NAACL, pages 95--101, 2015.

[10]

S. Ghannay, B. Favre, Y. Esteve, and N. Camelin. Word embedding evaluation and combination. In LREC, 2016.

[11]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Data Engineering (ICDE), pages 231--242. IEEE, 2011.

Digital Library

[12]

J. Goikoetxea, E. Agirre, and A. Soroa. Single or multiple? combining word representations independently learned from text and wordnet. In AAAI, pages 2608--2614, 2016.

Digital Library

[13]

J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33--51, 1975.

[14]

S. Jastrzebski, D. Le'sniak, and W. M. Czarnecki. How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. arXiv preprint arXiv:1702.02170, 2017.

[15]

S. Ji, N. Satish, S. Li, and P. Dubey. Parallelizing word2vec in shared and distributed memory. arXiv preprint arXiv:1604.04661, 2016.

[16]

D. A. Jurgens, P. D. Turney, S. M. Mohammad, and K. J. Holyoak. Semeval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, pages 356--364, 2012.

Digital Library

[17]

O. Levy and Y. Goldberg. Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), pages 2177--2185, 2014.

Digital Library

[18]

Y. Li, L. Xu, F. Tian, L. Jiang, X. Zhong, and E. Chen. Word embedding revisited: A new representation learning and explicit matrix factorization perspective. In IJCAI, pages 3650--3656, 2015.

Digital Library

[19]

M.-T. Luong, R. Socher, and C. D. Manning. Better word representations with recursive neural networks for morphology. In CoNLL, Sofia, Bulgaria, 2013.

[20]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.

Digital Library

[21]

T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In hlt-Naacl, volume 13, pages 746--751, 2013.

[22]

B. Mitra and N. Craswell. Neural models for information retrieval. arXiv preprint arXiv:1705.01509, 2017.

[23]

A. Murom"agi, K. Sirts, and S. Laur. Linear ensembles of word embedding models. arXiv preprint arXiv:1704.01419, 2017.

[24]

E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 83--84, 2016.

Digital Library

[25]

E. Ordentlich, L. Yang, A. Feng, P. Cnudde, M. Grbovic, N. Djuric, V. Radosavljevic, and G. Owens. Network-efficient distributed word2vec training system for large vocabularies. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1139--1148. ACM, 2016.

Digital Library

[26]

J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532--1543, 2014.

[27]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.

Digital Library

[28]

S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323--2326, 2000.

[29]

H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627--633, 1965.

Digital Library

[30]

P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1--10, 1966.

[31]

R. Socher, J. Bauer, C. D. Manning, et al. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 455--465, 2013.

[32]

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642, 2013.

[33]

R. Speer and J. Chin. An ensemble method to produce high-quality word embeddings. CoRR, abs/1604.01692, 2016.

[34]

S. Stergiou, Z. Straznickas, R. Wu, and K. Tsioutsiouliklis. Distributed negative sampling for word embeddings. In AAAI, pages 2569--2575, 2017.

Digital Library

[35]

V. Ustun, P. S. Rosenbloom, K. Sagae, and A. Demski. Distributed vector representations of words in the sigma cognitive architecture. In International Conference on Artificial General Intelligence, pages 196--207. Springer, 2014.

[36]

J. B. P. Vuurens, C. Eickhoff, and A. P. de Vries. Efficient parallel learning of word2vec. ICML 2016 Machine Learning workshop, 48, 2016.

[37]

J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011.

[38]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49--67, 2015.

[39]

W. Yin and H. Schütze. Learning meta-embeddings by using ensembles of embedding sets. arXiv preprint arXiv:1508.04257, 2015.

[40]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.

Digital Library

[41]

H. Zamani and W. B. Croft. Estimating embedding vectors for queries. In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, pages 123--132. ACM, 2016.

Digital Library

[42]

K. Zhao, H. Hassan, and M. Auli. Learning translation models from monolingual continuous representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1527--1536, 2015.

Cited By

Castelli Ottati RViscardi INesmachnow S(2025)High Performance Computing for Auto Supervised Machine Learning Training: Parallel-Distributed Implementation of the Word2Vec Algorithm for Training Word EmbeddingsHigh Performance Computing10.1007/978-3-031-80084-9_3(36-51)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-80084-9_3
Miao XZhang HShi YNie XYang ZTao YCui B(2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489511
Hirota MOda TEndo MIshikawa HChbeir RManolopoulos YIlarri SPapadopoulos A(2019)Generating Distributed Representation of User Movement for Extracting Detour SpotsProceedings of the 11th International Conference on Management of Digital EcoSystems10.1145/3297662.3365826(250-255)Online publication date: 12-Nov-2019
https://dl.acm.org/doi/10.1145/3297662.3365826

Index Terms

Asynchronous Training of Word Embeddings for Large Text Corpora
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Distributed computing methodologies
    1. Distributed algorithms
      1. MapReduce algorithms
2. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Dimensionality reduction

Recommendations

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and ...
Learning class-specific word embeddings
Abstract
Recent years have seen the success of applying word embedding algorithms to natural language processing (NLP) tasks. Most word embedding algorithms only produce a single embedding per word. This makes the learned embeddings indiscriminative since ...
Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora
Abstract
Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

January 2019

874 pages

ISBN:9781450359405

DOI:10.1145/3289600

General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ALEXANDRIA
SoBigData

Conference

WSDM '19

Sponsor:

WSDM '19: The Twelfth ACM International Conference on Web Search and Data Mining

February 11 - 15, 2019

Melbourne VIC, Australia

Acceptance Rates

WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
266
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Castelli Ottati RViscardi INesmachnow S(2025)High Performance Computing for Auto Supervised Machine Learning Training: Parallel-Distributed Implementation of the Word2Vec Algorithm for Training Word EmbeddingsHigh Performance Computing10.1007/978-3-031-80084-9_3(36-51)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-80084-9_3
Miao XZhang HShi YNie XYang ZTao YCui B(2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489511
Hirota MOda TEndo MIshikawa HChbeir RManolopoulos YIlarri SPapadopoulos A(2019)Generating Distributed Representation of User Movement for Extracting Detour SpotsProceedings of the 11th International Conference on Management of Digital EcoSystems10.1145/3297662.3365826(250-255)Online publication date: 12-Nov-2019
https://dl.acm.org/doi/10.1145/3297662.3365826

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten