skip to main content
10.1145/2736277.2741627acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Statistically Significant Detection of Linguistic Change

Published:18 May 2015Publication History

ABSTRACT

We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time.

We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

References

  1. R. P. Adams and D. J. MacKay. Bayesian online change-point detection. Cambridge, UK, 2007.Google ScholarGoogle Scholar
  2. R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. In CoNLL, 2013.Google ScholarGoogle Scholar
  3. M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Bengio, H. Schwenk, et al. Neural probabilistic language models. In Innovations in Machine Learning, pages 137--186. Springer, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798--1828, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Names. EC2, Nimes, France, 1991. EC2.Google ScholarGoogle Scholar
  7. H. A. Carneiro and E. Mylonakis. Google trends: A web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases, 49(10):1557--1564, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Chen, B. Perozzi, R. Al-Rfou, and S. Skiena. The expressive power of word embeddings. CoRR, abs/1301.3226, 2013.Google ScholarGoogle Scholar
  9. H. Choi and H. Varian. Predicting the present with google trends. Economic Record, 88:2--9, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Collobert, J. Weston, et al. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12: 2493--2537, Nov. 2011. Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Crystal. Internet Linguistics: A Student Guide. Routledge, New York, NY, 10001, 1st edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Efron and R. J. Tibshirani. An introduction to the bootstrap. 1971.Google ScholarGoogle Scholar
  13. J. R. Firth. Papers in Linguistics 1934-1951: Repr. Oxford University Press, 1961.Google ScholarGoogle Scholar
  14. Y. Goldberg and J. Orwant. A dataset of syntactic ngrams over time from a very large corpus of english books. In *SEM, 2013.Google ScholarGoogle Scholar
  15. K. Gulordava and M. Baroni. A distributional similarity approach to the detection of semantic change in the google books ngram corpus. In GEMS, July 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Immerwahr. The books of the century, 2014. URL http://www.ocf.berkeley.edu/~immer/books1970s.Google ScholarGoogle Scholar
  17. A. Jatowt and K. Duh. A framework for analyzing semantic change of words across time. In Proceedings of the Joint JCDL/TPDL Digital Libraries Conference, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Juola. The time course of language change. Computers and the Humanities, 37(1):77--96, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. Y. Kim, Y.-I. Chiu, K. Hanaki, et al. Temporal analysis of language through neural language models. In ACL, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Lijffijt, T. Saily, and T. Nevalainen. Ceecing the baseline: Lexical stability and significant change in a historical corpus. VARIENG, 2012.Google ScholarGoogle Scholar
  21. J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37 (1):145--151, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Lin, J. B. Michel, E. L. Aiden, J. Orwant, W. Brockman, and S. Petrov. Syntactic annotations for the google books ngram corpus. In ACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Mann, D. Zhang, et al. Enhanced search with wildcards and morphological inflections in the google books ngram viewer. In Proceedings of ACL Demonstrations Track. Association for Computational Linguistics, June 2014.Google ScholarGoogle ScholarCross RefCross Ref
  24. G. Merchant. Teenagers in cyberspace: an investigation of language use and language change in internet chatrooms. Journal of Research in Reading, 24:293--306, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. B. Michel, Y. K. Shen, et al. Quantitative analysis of culture using millions of digitized books. Science, 331 (6014):176--182, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  26. T. Mikolov et al. Linguistic regularities in continuous space word representations. In Proceedings of NAACLHLT, 2013.Google ScholarGoogle Scholar
  27. T. Mikolov et al. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Mikolov et al. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google ScholarGoogle Scholar
  29. S. Mitra, R. Mitra, et al. That's sick dude!: Automatic identification of word sense change across different timescales. In ACL, 2014.Google ScholarGoogle Scholar
  30. A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. NIPS, 21:1081--1088, 2009.Google ScholarGoogle Scholar
  31. F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the inter- national workshop on artificial intelligence and statistics, pages 246--252, 2005.Google ScholarGoogle Scholar
  32. B. Perozzi, R. Al-Rfou, V. Kulkarni, and S. Skiena. Inducing language networks from continuous space word representations. In Complex Networks V, volume 549 of Studies in Computational Intelligence, pages 261--273. 2014.Google ScholarGoogle Scholar
  33. B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD, New York, NY, USA, August 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 1:213, 2002.Google ScholarGoogle Scholar
  35. T. Saily, T. Nevalainen, and H. Siirtola. Variation in noun and pronoun frequencies in a sociohistorical corpus of english. Literary and Linguistic Computing, 26(2): 167--188, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. D. J. Schiano, C. P. Chen, E. Isaacs, J. Ginsberg, U. Gretarsdottir, and M. Huddleston. Teen use of messaging media. In Computer Human Interaction, pages 594--595, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. A. Tagliamonte and D. Denis. Linguistc Ruin? LOL! Instant messaging and teen language. American Speech, 83:3--34, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  38. W. A. Taylor. Change-point analysis: A powerful new tool for detecting changes, 2000.Google ScholarGoogle Scholar
  39. D. T. Wijaya and R. Yeniterzi. Understanding semantic change of words over centuries. In DETECT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Statistically Significant Detection of Linguistic Change

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '15: Proceedings of the 24th International Conference on World Wide Web
      May 2015
      1460 pages
      ISBN:9781450334693

      Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)

      Publisher

      International World Wide Web Conferences Steering Committee

      Republic and Canton of Geneva, Switzerland

      Publication History

      • Published: 18 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader