skip to main content
10.1145/2513549.2513551acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open Access

Mining semantics for culturomics: towards a knowledge-based approach

Published:28 October 2013Publication History

ABSTRACT

The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.

References

  1. Acerbi, A., Lampos, V., Garnett, P., and Bentley, A. 2013. The expression of emotions in 20th century books. PLoS ONE. 8 (3). doi:10.1371/journal.pone.0059030. PMID 23527080.Google ScholarGoogle Scholar
  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In Proceedings of the 6th international semantic web and 2nd Asian conference on Asian semantic web conference (ISWC + ASWC). Busan, Korea. 722--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baroni, M. and Bernardini, S. (editors). 2006. Wacky! Working papers on the Web as Corpus. Gedit, Bologna.Google ScholarGoogle Scholar
  4. Berberich, K., Bedathur, S., Sozio, M., and Weikum, G. 2009. Bridging the terminology gap in web archive search. In Proceedings of the 12th International Workshop on the Web and Databases. WebDB. Rhode Island, USA.Google ScholarGoogle Scholar
  5. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. (2009). DBpedia -- a crystallization point for the web of data. Journal of Web Semantics. 154--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Björkelund, A., Hafdell, L., and Nugues, P. 2009. Multilingual semantic role labeling. In Proceedings of the 15th Conference on Computational Natural Language Learning (CoNLL). Boulder, USA. 43--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blei, D. 2012. Probabilistic topic models. Communications of the ACM, 55, 4. doi:10.1145/2133806.2133826 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bohannon, J. 2011. Google Books, Wikipedia, and the future of culturomics. Science. Jan 14;331(6014),135. doi: 10.1126/science.331.6014.135. 2011.Google ScholarGoogle Scholar
  9. Bonzanini, M., Martinez-Alvarez, M., and Roelleke, T. 2013. Extractive summarisation via sentence removal: Condensing relevant sentences into a short summary. In Proceedings of the 36th ACM Special Interest Group on Information Retrieval (SIGIR). Dublin, Ireland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Borin, L., Danélls, D., Forsberg, M., Kokkinakis, D., and Toporowska Gronostaj, M. 2010. The past meets the present in Swedish FrameNet++. In Proceedings of the 14th EURALEX International Congress. Leeuwarden, Netherlands. 269--281.Google ScholarGoogle Scholar
  11. Borin, L., Forsberg, M., and Roxendal, J. 2012. Korp -- the corpus infrastructure of Språkbanken. In Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA. Istanbul, Turkey. 474--478.Google ScholarGoogle Scholar
  12. Borin, L., Kokkinakis, D., and Olsson, L-J. 2007. Naming the past: Named entity and animacy recognition in 19th century Swedish literature. In Proceedings of the ACL Workshop: Language Technology for Cultural Heritage Data (LaTeCh). ACL. Prague, Czech Republic. 1--8.Google ScholarGoogle Scholar
  13. Borin, L. and Kokkinakis D. 2010. Literary onomastics and language technology. In Literary Education and Digital Learning, van Peer, W., Zyngier, S. and Viana, V. (eds.). Information Science Reference, Hershey - New York, 53--78. doi:10.4018/978-1-60566-932-8.Google ScholarGoogle Scholar
  14. Burchardt, A., Erk, K., Frank, A., Kowalski, A., and Padó, S. 2006. SALTO - A versatile multi-level annotation tool. In Proceedings of the 5th Language Resources and Evaluation Conference (LREC). Genoa, Italy.Google ScholarGoogle Scholar
  15. Buyko, E., Faessler, E., Wermter, J., and Hahn, U. 2011. Syntactic simplification and semantic enrichment - trimming dependency graphs for event extraction. Computational Intelligence 27.4, 610--644.Google ScholarGoogle ScholarCross RefCross Ref
  16. Chaudhry, B. 2012. Putting IBM Watson to work in healthcare. In Analytics in Support of Health Care Transformation : Making Better Health Care Decisions with IBM Watson and Advanced Analytics. Washington D.C., USA. https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Basit%20Chaudhry's%20Presentation/$file/Basit%20Chaudhry's%20Presentation.pdfGoogle ScholarGoogle Scholar
  17. Christensen, J., Mausam, Soderland, S., and Etzioni, O. 2010. Semantic role labeling for open information extraction. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading. Los Angeles, CA., USA. 52--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Exner, P. and Nugues, P. 2011. Using semantic role labeling to extract events from Wikipedia. In Proceedings of DeRiVE 2011, Bonn.Google ScholarGoogle Scholar
  19. Exner, P. and Nugues, P. 2012a. Constructing large proposition databases. In Proceedings of the Language Resources and Evaluation (LREC). Istanbul, Turkey. 3836--3840.Google ScholarGoogle Scholar
  20. Exner, P. and Nugues, P. 2012b. Entity extraction: From unstructured text to DBpedia RDF triples. In Proceedings of WoLE 2012, CEUR Workshop Proceedings. Boston. 58--69.Google ScholarGoogle Scholar
  21. Exner, P. and Nugues, P. 2012c. Ontology matching: from PropBank to DBpedia. In Proceedings of the Swedish Language Technology Conference (SLTC). Lund, Sweden. 25--26.Google ScholarGoogle Scholar
  22. Fan, J., Kalyanpur, A., Gondek, D. C., and Ferrucci, D. A. 2012. Automatic knowledge extraction from documents. IBM Journal of Research and Development. 56,3.4, 5:1--5:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ferrucci, D. 2012. Introduction to "This is Watson". IBM Journal of Research and Development. 56, 3.4, 1:1--1:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J., Schlaefer, N., and Welty, C. 2010. Building Watson: An overview of the DeepQA project. AI Magazine, Fall 2010, 59--79.Google ScholarGoogle Scholar
  25. Fillmore, C., Johnson, C., and Petruck., M. 2003. Background to FrameNet. International Journal of Lexicography, 16, 3, 235--250.Google ScholarGoogle ScholarCross RefCross Ref
  26. Fürstenau, H. and Lapata, M. 2009. Semi-supervised semantic role labeling. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Athens, Greece. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gildea, D. and Jurafsky, D. 2002. Automatic labeling of semantic roles. Computational Linguistics. 28,3, 245--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Goldberg, Y. and Orwant, J. 2013. A dataset of syntactic-Ngrams over time from a very large corpus of English books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task. Atlanta, Georgia, USA. 241--247.Google ScholarGoogle Scholar
  29. Halevy, A., Norvig, P., and Pereira, F. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24, 2, 8--12. doi:10.1109/MIS.2009.36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hall, D., Jurafsky, D., and Manning, C. D. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Honolulu, USA. 363--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hitchcock, T. 2011. Culturomics, Big Data, Code Breakers and the Casaubon Delusion. http://historyonics.blogspot.se/2011/06/culturomics-big-data-code-breakers-and.htmlGoogle ScholarGoogle Scholar
  32. Hoover, Q. 2013. Transforming Health Care Through Big Data. Strategies for Leveraging Big Data in the Health Care Industry. Institute for Health Technology Transformation. New York, USA.Google ScholarGoogle Scholar
  33. Jethava, V., Martinsson, A., Bhattacharyya, C., and Dubhashi, D. 2012. The Lovasz -- function, SVMs and finding large dense subgraphs. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS). Lake Tahoe, NV; USA. 1169--1177.Google ScholarGoogle Scholar
  34. Jockers, M. L. 2013. Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). UIUC Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Johansson, R., Friberg Heppin, K., and Kokkinakis, D. 2012. Semantic role labeling with the Swedish FrameNet. In Proceedings of 8th Language Resources and Evaluation Conference (LREC). Istanbul, Turkey. 3697--3700.Google ScholarGoogle Scholar
  36. Johansson, R. and Nugues, P. 2008. Dependency-based syntactic-semantic analysis with PropBank and NomBank. In Proceedings of Conference on Natural Language Learning (CoNLL). Manchester, UK. 183--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kaluarachchi, A., Roychoudhury, D., Varde, A.S., and Weikum, G. 2011. Sitac: discovering semantically identical temporally altering concepts in text archives. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT/ICDT). ACM, New York, NY, USA. 566--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kaluarachchi, A. C., Varde, A. S., Bedathur, S., Weikum, G., Peng, J., and Feldman, A. 2010. Incorporating terminology evolution for query translation in text retrieval with association rules. In Proceedings of the 19th ACM international conference on Information and knowledge management, CIKM '10. ACM, New York, NY, USA. 1789--1792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kanhabua, N. and Nørvåg, K. 2010. Exploiting timebased synonyms in searching document archives. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL '10. ACM, New York, NY, USA. 79--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Leetaru, K. 2011. Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday. 16, 9. Chicago University Library. Available online: http://journals.uic.edu/ojs/index.php/fm/article/view/3663/3040#p7Google ScholarGoogle ScholarCross RefCross Ref
  41. Lin, H. and Bilmes, J. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT). Portland, Oregon. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mausam, Schmitz, M., Bart, R., Soderland, S., and Etzioni, O. 2012. Open language learning for information extraction. In Proceeding of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(EMNLP-CoNLL). Jeju, Korea. 523--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mazeika, A., Tylenda, T., and Weikum, G. 2011. Entity timelines: Visual analytics and named entity evolution. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA. 2585--2588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Michel, J-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Aiden, E. L. 2010. Quantitative analysis of culture using millions of digitized books. Science 331 (6014), 176--82. doi: 10.1126/science.1199644.Google ScholarGoogle ScholarCross RefCross Ref
  45. Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Meeting of the ACL and the 4th International Joint Conference on NLP of the AFNLP: Volume 2. ACL, Singapore, 1003--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Moretti, F. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. R. R. Donnelley & Sons.Google ScholarGoogle Scholar
  47. Nguyen, T. V. T. and Moschitti, A. 2011. End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the ACL. Portland, Oregon, USA. 277--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Nugues, P. M. 2006. An Introduction to Language Processing with Perl and Prolog. An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German. Springer Verlag, Berlin Heidelberg New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. O'Reilly. 2012. Big Data Now: 2012 Edition. O'Reilly Media, Inc.Google ScholarGoogle Scholar
  50. Petersen, A. P., Tenenbaum, J., Havlin, S., and Stanley, H. E. 2012. Statistical laws governing fluctuations in word use from word birth to word death. Scientific Reports 2, 313. doi: 10.1038/srep00313Google ScholarGoogle ScholarCross RefCross Ref
  51. Punyakanok, V., Roth, D., and Yih, W. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics. 34, 2, 257--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rohrdantz, C., Hautli, A., Mayer, T., Butt, M., Keim, D. A., and Plank, F. 2011. Towards tracking semantic change by visual analytics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon. 305--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Suchanek, F. M., Kasneci, G., and Weikum, G. 2007.Yago - A core of semantic knowledge. In Proceedings of the 16th international World Wide Web conference (WWW07). Alberta, Canada. 697--706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zimmer, B. 2011. Twitterology: A new science? The New York Times - Sunday Review. http://www.nytimes.com/2011/10/30/opinion/sunday/twitterology-a-new-science.html?_r=0Google ScholarGoogle Scholar

Index Terms

  1. Mining semantics for culturomics: towards a knowledge-based approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
          October 2013
          74 pages
          ISBN:9781450324151
          DOI:10.1145/2513549

          Copyright © 2013 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 October 2013

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          UnstructureNLP '13 Paper Acceptance Rate9of12submissions,75%Overall Acceptance Rate9of12submissions,75%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader