ABSTRACT
The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.
- Acerbi, A., Lampos, V., Garnett, P., and Bentley, A. 2013. The expression of emotions in 20th century books. PLoS ONE. 8 (3). doi:10.1371/journal.pone.0059030. PMID 23527080.Google Scholar
- Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In Proceedings of the 6th international semantic web and 2nd Asian conference on Asian semantic web conference (ISWC + ASWC). Busan, Korea. 722--735. Google ScholarDigital Library
- Baroni, M. and Bernardini, S. (editors). 2006. Wacky! Working papers on the Web as Corpus. Gedit, Bologna.Google Scholar
- Berberich, K., Bedathur, S., Sozio, M., and Weikum, G. 2009. Bridging the terminology gap in web archive search. In Proceedings of the 12th International Workshop on the Web and Databases. WebDB. Rhode Island, USA.Google Scholar
- Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. (2009). DBpedia -- a crystallization point for the web of data. Journal of Web Semantics. 154--165. Google ScholarDigital Library
- Björkelund, A., Hafdell, L., and Nugues, P. 2009. Multilingual semantic role labeling. In Proceedings of the 15th Conference on Computational Natural Language Learning (CoNLL). Boulder, USA. 43--48. Google ScholarDigital Library
- Blei, D. 2012. Probabilistic topic models. Communications of the ACM, 55, 4. doi:10.1145/2133806.2133826 Google ScholarDigital Library
- Bohannon, J. 2011. Google Books, Wikipedia, and the future of culturomics. Science. Jan 14;331(6014),135. doi: 10.1126/science.331.6014.135. 2011.Google Scholar
- Bonzanini, M., Martinez-Alvarez, M., and Roelleke, T. 2013. Extractive summarisation via sentence removal: Condensing relevant sentences into a short summary. In Proceedings of the 36th ACM Special Interest Group on Information Retrieval (SIGIR). Dublin, Ireland. Google ScholarDigital Library
- Borin, L., Danélls, D., Forsberg, M., Kokkinakis, D., and Toporowska Gronostaj, M. 2010. The past meets the present in Swedish FrameNet++. In Proceedings of the 14th EURALEX International Congress. Leeuwarden, Netherlands. 269--281.Google Scholar
- Borin, L., Forsberg, M., and Roxendal, J. 2012. Korp -- the corpus infrastructure of Språkbanken. In Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA. Istanbul, Turkey. 474--478.Google Scholar
- Borin, L., Kokkinakis, D., and Olsson, L-J. 2007. Naming the past: Named entity and animacy recognition in 19th century Swedish literature. In Proceedings of the ACL Workshop: Language Technology for Cultural Heritage Data (LaTeCh). ACL. Prague, Czech Republic. 1--8.Google Scholar
- Borin, L. and Kokkinakis D. 2010. Literary onomastics and language technology. In Literary Education and Digital Learning, van Peer, W., Zyngier, S. and Viana, V. (eds.). Information Science Reference, Hershey - New York, 53--78. doi:10.4018/978-1-60566-932-8.Google Scholar
- Burchardt, A., Erk, K., Frank, A., Kowalski, A., and Padó, S. 2006. SALTO - A versatile multi-level annotation tool. In Proceedings of the 5th Language Resources and Evaluation Conference (LREC). Genoa, Italy.Google Scholar
- Buyko, E., Faessler, E., Wermter, J., and Hahn, U. 2011. Syntactic simplification and semantic enrichment - trimming dependency graphs for event extraction. Computational Intelligence 27.4, 610--644.Google ScholarCross Ref
- Chaudhry, B. 2012. Putting IBM Watson to work in healthcare. In Analytics in Support of Health Care Transformation : Making Better Health Care Decisions with IBM Watson and Advanced Analytics. Washington D.C., USA. https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Basit%20Chaudhry's%20Presentation/$file/Basit%20Chaudhry's%20Presentation.pdfGoogle Scholar
- Christensen, J., Mausam, Soderland, S., and Etzioni, O. 2010. Semantic role labeling for open information extraction. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading. Los Angeles, CA., USA. 52--60. Google ScholarDigital Library
- Exner, P. and Nugues, P. 2011. Using semantic role labeling to extract events from Wikipedia. In Proceedings of DeRiVE 2011, Bonn.Google Scholar
- Exner, P. and Nugues, P. 2012a. Constructing large proposition databases. In Proceedings of the Language Resources and Evaluation (LREC). Istanbul, Turkey. 3836--3840.Google Scholar
- Exner, P. and Nugues, P. 2012b. Entity extraction: From unstructured text to DBpedia RDF triples. In Proceedings of WoLE 2012, CEUR Workshop Proceedings. Boston. 58--69.Google Scholar
- Exner, P. and Nugues, P. 2012c. Ontology matching: from PropBank to DBpedia. In Proceedings of the Swedish Language Technology Conference (SLTC). Lund, Sweden. 25--26.Google Scholar
- Fan, J., Kalyanpur, A., Gondek, D. C., and Ferrucci, D. A. 2012. Automatic knowledge extraction from documents. IBM Journal of Research and Development. 56,3.4, 5:1--5:10. Google ScholarDigital Library
- Ferrucci, D. 2012. Introduction to "This is Watson". IBM Journal of Research and Development. 56, 3.4, 1:1--1:15. Google ScholarDigital Library
- Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J., Schlaefer, N., and Welty, C. 2010. Building Watson: An overview of the DeepQA project. AI Magazine, Fall 2010, 59--79.Google Scholar
- Fillmore, C., Johnson, C., and Petruck., M. 2003. Background to FrameNet. International Journal of Lexicography, 16, 3, 235--250.Google ScholarCross Ref
- Fürstenau, H. and Lapata, M. 2009. Semi-supervised semantic role labeling. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Athens, Greece. Google ScholarDigital Library
- Gildea, D. and Jurafsky, D. 2002. Automatic labeling of semantic roles. Computational Linguistics. 28,3, 245--288. Google ScholarDigital Library
- Goldberg, Y. and Orwant, J. 2013. A dataset of syntactic-Ngrams over time from a very large corpus of English books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task. Atlanta, Georgia, USA. 241--247.Google Scholar
- Halevy, A., Norvig, P., and Pereira, F. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24, 2, 8--12. doi:10.1109/MIS.2009.36. Google ScholarDigital Library
- Hall, D., Jurafsky, D., and Manning, C. D. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Honolulu, USA. 363--371. Google ScholarDigital Library
- Hitchcock, T. 2011. Culturomics, Big Data, Code Breakers and the Casaubon Delusion. http://historyonics.blogspot.se/2011/06/culturomics-big-data-code-breakers-and.htmlGoogle Scholar
- Hoover, Q. 2013. Transforming Health Care Through Big Data. Strategies for Leveraging Big Data in the Health Care Industry. Institute for Health Technology Transformation. New York, USA.Google Scholar
- Jethava, V., Martinsson, A., Bhattacharyya, C., and Dubhashi, D. 2012. The Lovasz -- function, SVMs and finding large dense subgraphs. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS). Lake Tahoe, NV; USA. 1169--1177.Google Scholar
- Jockers, M. L. 2013. Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). UIUC Press. Google ScholarDigital Library
- Johansson, R., Friberg Heppin, K., and Kokkinakis, D. 2012. Semantic role labeling with the Swedish FrameNet. In Proceedings of 8th Language Resources and Evaluation Conference (LREC). Istanbul, Turkey. 3697--3700.Google Scholar
- Johansson, R. and Nugues, P. 2008. Dependency-based syntactic-semantic analysis with PropBank and NomBank. In Proceedings of Conference on Natural Language Learning (CoNLL). Manchester, UK. 183--187. Google ScholarDigital Library
- Kaluarachchi, A., Roychoudhury, D., Varde, A.S., and Weikum, G. 2011. Sitac: discovering semantically identical temporally altering concepts in text archives. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT/ICDT). ACM, New York, NY, USA. 566--569. Google ScholarDigital Library
- Kaluarachchi, A. C., Varde, A. S., Bedathur, S., Weikum, G., Peng, J., and Feldman, A. 2010. Incorporating terminology evolution for query translation in text retrieval with association rules. In Proceedings of the 19th ACM international conference on Information and knowledge management, CIKM '10. ACM, New York, NY, USA. 1789--1792. Google ScholarDigital Library
- Kanhabua, N. and Nørvåg, K. 2010. Exploiting timebased synonyms in searching document archives. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL '10. ACM, New York, NY, USA. 79--88. Google ScholarDigital Library
- Leetaru, K. 2011. Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday. 16, 9. Chicago University Library. Available online: http://journals.uic.edu/ojs/index.php/fm/article/view/3663/3040#p7Google ScholarCross Ref
- Lin, H. and Bilmes, J. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT). Portland, Oregon. Google ScholarDigital Library
- Mausam, Schmitz, M., Bart, R., Soderland, S., and Etzioni, O. 2012. Open language learning for information extraction. In Proceeding of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(EMNLP-CoNLL). Jeju, Korea. 523--534. Google ScholarDigital Library
- Mazeika, A., Tylenda, T., and Weikum, G. 2011. Entity timelines: Visual analytics and named entity evolution. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA. 2585--2588. Google ScholarDigital Library
- Michel, J-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Aiden, E. L. 2010. Quantitative analysis of culture using millions of digitized books. Science 331 (6014), 176--82. doi: 10.1126/science.1199644.Google ScholarCross Ref
- Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Meeting of the ACL and the 4th International Joint Conference on NLP of the AFNLP: Volume 2. ACL, Singapore, 1003--1011. Google ScholarDigital Library
- Moretti, F. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. R. R. Donnelley & Sons.Google Scholar
- Nguyen, T. V. T. and Moschitti, A. 2011. End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the ACL. Portland, Oregon, USA. 277--282. Google ScholarDigital Library
- Nugues, P. M. 2006. An Introduction to Language Processing with Perl and Prolog. An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German. Springer Verlag, Berlin Heidelberg New York. Google ScholarDigital Library
- O'Reilly. 2012. Big Data Now: 2012 Edition. O'Reilly Media, Inc.Google Scholar
- Petersen, A. P., Tenenbaum, J., Havlin, S., and Stanley, H. E. 2012. Statistical laws governing fluctuations in word use from word birth to word death. Scientific Reports 2, 313. doi: 10.1038/srep00313Google ScholarCross Ref
- Punyakanok, V., Roth, D., and Yih, W. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics. 34, 2, 257--287. Google ScholarDigital Library
- Rohrdantz, C., Hautli, A., Mayer, T., Butt, M., Keim, D. A., and Plank, F. 2011. Towards tracking semantic change by visual analytics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon. 305--310. Google ScholarDigital Library
- Suchanek, F. M., Kasneci, G., and Weikum, G. 2007.Yago - A core of semantic knowledge. In Proceedings of the 16th international World Wide Web conference (WWW07). Alberta, Canada. 697--706. Google ScholarDigital Library
- Zimmer, B. 2011. Twitterology: A new science? The New York Times - Sunday Review. http://www.nytimes.com/2011/10/30/opinion/sunday/twitterology-a-new-science.html?_r=0Google Scholar
Index Terms
- Mining semantics for culturomics: towards a knowledge-based approach
Recommendations
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
A Basic Language Resource Kit Implementation for the IgboNLP Project
Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation
New semantic tagger based on a large English historical thesaurus of 793,742 lexemes.Automatically classify words into 225,000 concepts and 4033 thematic categories.Time-sensitive semantic tagger reflecting history of English word usage.Comprehensive ...
Comments