research-article

Open Access

Mining semantics for culturomics: towards a knowledge-based approach

Authors:
Lars Borin

University of Gothenburg, Gothenburg, Sweden

University of Gothenburg, Gothenburg, Sweden
View Profile

,
Devdatt Dubhashi

Chalmers University of Technology, Gothenburg, Sweden

Chalmers University of Technology, Gothenburg, Sweden
View Profile

,
Markus Forsberg

University of Gothenburg, Gothenburg, Sweden

University of Gothenburg, Gothenburg, Sweden
View Profile

,
Richard Johansson

University of Gothenburg, Gothenburg, Sweden

University of Gothenburg, Gothenburg, Sweden
View Profile

,
Dimitrios Kokkinakis

University of Gothenburg, Gothenburg, Sweden

University of Gothenburg, Gothenburg, Sweden
View Profile

,
Pierre Nugues

Lund University, Lund, Sweden

Lund University, Lund, Sweden
View Profile

UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processingOctober 2013Pages 3–10https://doi.org/10.1145/2513549.2513551

Published:28 October 2013Publication History

UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing

Pages 3–10

ABSTRACT

The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.

References

Acerbi, A., Lampos, V., Garnett, P., and Bentley, A. 2013. The expression of emotions in 20th century books. PLoS ONE. 8 (3). doi:10.1371/journal.pone.0059030. PMID 23527080.Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In Proceedings of the 6th international semantic web and 2nd Asian conference on Asian semantic web conference (ISWC + ASWC). Busan, Korea. 722--735. Google ScholarDigital Library
Baroni, M. and Bernardini, S. (editors). 2006. Wacky! Working papers on the Web as Corpus. Gedit, Bologna.Google Scholar
Berberich, K., Bedathur, S., Sozio, M., and Weikum, G. 2009. Bridging the terminology gap in web archive search. In Proceedings of the 12th International Workshop on the Web and Databases. WebDB. Rhode Island, USA.Google Scholar
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. (2009). DBpedia -- a crystallization point for the web of data. Journal of Web Semantics. 154--165. Google ScholarDigital Library
Björkelund, A., Hafdell, L., and Nugues, P. 2009. Multilingual semantic role labeling. In Proceedings of the 15th Conference on Computational Natural Language Learning (CoNLL). Boulder, USA. 43--48. Google ScholarDigital Library
Blei, D. 2012. Probabilistic topic models. Communications of the ACM, 55, 4. doi:10.1145/2133806.2133826 Google ScholarDigital Library
Bohannon, J. 2011. Google Books, Wikipedia, and the future of culturomics. Science. Jan 14;331(6014),135. doi: 10.1126/science.331.6014.135. 2011.Google Scholar
Bonzanini, M., Martinez-Alvarez, M., and Roelleke, T. 2013. Extractive summarisation via sentence removal: Condensing relevant sentences into a short summary. In Proceedings of the 36th ACM Special Interest Group on Information Retrieval (SIGIR). Dublin, Ireland. Google ScholarDigital Library
Borin, L., Danélls, D., Forsberg, M., Kokkinakis, D., and Toporowska Gronostaj, M. 2010. The past meets the present in Swedish FrameNet++. In Proceedings of the 14th EURALEX International Congress. Leeuwarden, Netherlands. 269--281.Google Scholar
Borin, L., Forsberg, M., and Roxendal, J. 2012. Korp -- the corpus infrastructure of Språkbanken. In Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA. Istanbul, Turkey. 474--478.Google Scholar
Borin, L., Kokkinakis, D., and Olsson, L-J. 2007. Naming the past: Named entity and animacy recognition in 19th century Swedish literature. In Proceedings of the ACL Workshop: Language Technology for Cultural Heritage Data (LaTeCh). ACL. Prague, Czech Republic. 1--8.Google Scholar
Borin, L. and Kokkinakis D. 2010. Literary onomastics and language technology. In Literary Education and Digital Learning, van Peer, W., Zyngier, S. and Viana, V. (eds.). Information Science Reference, Hershey - New York, 53--78. doi:10.4018/978-1-60566-932-8.Google Scholar
Burchardt, A., Erk, K., Frank, A., Kowalski, A., and Padó, S. 2006. SALTO - A versatile multi-level annotation tool. In Proceedings of the 5th Language Resources and Evaluation Conference (LREC). Genoa, Italy.Google Scholar
Buyko, E., Faessler, E., Wermter, J., and Hahn, U. 2011. Syntactic simplification and semantic enrichment - trimming dependency graphs for event extraction. Computational Intelligence 27.4, 610--644.Google ScholarCross Ref
Chaudhry, B. 2012. Putting IBM Watson to work in healthcare. In Analytics in Support of Health Care Transformation : Making Better Health Care Decisions with IBM Watson and Advanced Analytics. Washington D.C., USA. https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Basit%20Chaudhry's%20Presentation/$file/Basit%20Chaudhry's%20Presentation.pdfGoogle Scholar
Christensen, J., Mausam, Soderland, S., and Etzioni, O. 2010. Semantic role labeling for open information extraction. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading. Los Angeles, CA., USA. 52--60. Google ScholarDigital Library
Exner, P. and Nugues, P. 2011. Using semantic role labeling to extract events from Wikipedia. In Proceedings of DeRiVE 2011, Bonn.Google Scholar
Exner, P. and Nugues, P. 2012a. Constructing large proposition databases. In Proceedings of the Language Resources and Evaluation (LREC). Istanbul, Turkey. 3836--3840.Google Scholar
Exner, P. and Nugues, P. 2012b. Entity extraction: From unstructured text to DBpedia RDF triples. In Proceedings of WoLE 2012, CEUR Workshop Proceedings. Boston. 58--69.Google Scholar
Exner, P. and Nugues, P. 2012c. Ontology matching: from PropBank to DBpedia. In Proceedings of the Swedish Language Technology Conference (SLTC). Lund, Sweden. 25--26.Google Scholar
Fan, J., Kalyanpur, A., Gondek, D. C., and Ferrucci, D. A. 2012. Automatic knowledge extraction from documents. IBM Journal of Research and Development. 56,3.4, 5:1--5:10. Google ScholarDigital Library
Ferrucci, D. 2012. Introduction to "This is Watson". IBM Journal of Research and Development. 56, 3.4, 1:1--1:15. Google ScholarDigital Library
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J., Schlaefer, N., and Welty, C. 2010. Building Watson: An overview of the DeepQA project. AI Magazine, Fall 2010, 59--79.Google Scholar
Fillmore, C., Johnson, C., and Petruck., M. 2003. Background to FrameNet. International Journal of Lexicography, 16, 3, 235--250.Google ScholarCross Ref
Fürstenau, H. and Lapata, M. 2009. Semi-supervised semantic role labeling. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Athens, Greece. Google ScholarDigital Library
Gildea, D. and Jurafsky, D. 2002. Automatic labeling of semantic roles. Computational Linguistics. 28,3, 245--288. Google ScholarDigital Library
Goldberg, Y. and Orwant, J. 2013. A dataset of syntactic-Ngrams over time from a very large corpus of English books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task. Atlanta, Georgia, USA. 241--247.Google Scholar
Halevy, A., Norvig, P., and Pereira, F. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24, 2, 8--12. doi:10.1109/MIS.2009.36. Google ScholarDigital Library
Hall, D., Jurafsky, D., and Manning, C. D. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Honolulu, USA. 363--371. Google ScholarDigital Library
Hitchcock, T. 2011. Culturomics, Big Data, Code Breakers and the Casaubon Delusion. http://historyonics.blogspot.se/2011/06/culturomics-big-data-code-breakers-and.htmlGoogle Scholar
Hoover, Q. 2013. Transforming Health Care Through Big Data. Strategies for Leveraging Big Data in the Health Care Industry. Institute for Health Technology Transformation. New York, USA.Google Scholar
Jethava, V., Martinsson, A., Bhattacharyya, C., and Dubhashi, D. 2012. The Lovasz -- function, SVMs and finding large dense subgraphs. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS). Lake Tahoe, NV; USA. 1169--1177.Google Scholar
Jockers, M. L. 2013. Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). UIUC Press. Google ScholarDigital Library
Johansson, R., Friberg Heppin, K., and Kokkinakis, D. 2012. Semantic role labeling with the Swedish FrameNet. In Proceedings of 8th Language Resources and Evaluation Conference (LREC). Istanbul, Turkey. 3697--3700.Google Scholar
Johansson, R. and Nugues, P. 2008. Dependency-based syntactic-semantic analysis with PropBank and NomBank. In Proceedings of Conference on Natural Language Learning (CoNLL). Manchester, UK. 183--187. Google ScholarDigital Library
Kaluarachchi, A., Roychoudhury, D., Varde, A.S., and Weikum, G. 2011. Sitac: discovering semantically identical temporally altering concepts in text archives. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT/ICDT). ACM, New York, NY, USA. 566--569. Google ScholarDigital Library
Kaluarachchi, A. C., Varde, A. S., Bedathur, S., Weikum, G., Peng, J., and Feldman, A. 2010. Incorporating terminology evolution for query translation in text retrieval with association rules. In Proceedings of the 19th ACM international conference on Information and knowledge management, CIKM '10. ACM, New York, NY, USA. 1789--1792. Google ScholarDigital Library
Kanhabua, N. and Nørvåg, K. 2010. Exploiting timebased synonyms in searching document archives. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL '10. ACM, New York, NY, USA. 79--88. Google ScholarDigital Library
Leetaru, K. 2011. Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday. 16, 9. Chicago University Library. Available online: http://journals.uic.edu/ojs/index.php/fm/article/view/3663/3040#p7Google ScholarCross Ref
Lin, H. and Bilmes, J. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT). Portland, Oregon. Google ScholarDigital Library
Mausam, Schmitz, M., Bart, R., Soderland, S., and Etzioni, O. 2012. Open language learning for information extraction. In Proceeding of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(EMNLP-CoNLL). Jeju, Korea. 523--534. Google ScholarDigital Library
Mazeika, A., Tylenda, T., and Weikum, G. 2011. Entity timelines: Visual analytics and named entity evolution. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA. 2585--2588. Google ScholarDigital Library
Michel, J-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Aiden, E. L. 2010. Quantitative analysis of culture using millions of digitized books. Science 331 (6014), 176--82. doi: 10.1126/science.1199644.Google ScholarCross Ref
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Meeting of the ACL and the 4th International Joint Conference on NLP of the AFNLP: Volume 2. ACL, Singapore, 1003--1011. Google ScholarDigital Library
Moretti, F. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. R. R. Donnelley & Sons.Google Scholar
Nguyen, T. V. T. and Moschitti, A. 2011. End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the ACL. Portland, Oregon, USA. 277--282. Google ScholarDigital Library
Nugues, P. M. 2006. An Introduction to Language Processing with Perl and Prolog. An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German. Springer Verlag, Berlin Heidelberg New York. Google ScholarDigital Library
O'Reilly. 2012. Big Data Now: 2012 Edition. O'Reilly Media, Inc.Google Scholar
Petersen, A. P., Tenenbaum, J., Havlin, S., and Stanley, H. E. 2012. Statistical laws governing fluctuations in word use from word birth to word death. Scientific Reports 2, 313. doi: 10.1038/srep00313Google ScholarCross Ref
Punyakanok, V., Roth, D., and Yih, W. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics. 34, 2, 257--287. Google ScholarDigital Library
Rohrdantz, C., Hautli, A., Mayer, T., Butt, M., Keim, D. A., and Plank, F. 2011. Towards tracking semantic change by visual analytics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon. 305--310. Google ScholarDigital Library
Suchanek, F. M., Kasneci, G., and Weikum, G. 2007.Yago - A core of semantic knowledge. In Proceedings of the 16th international World Wide Web conference (WWW07). Alberta, Canada. 697--706. Google ScholarDigital Library
Zimmer, B. 2011. Twitterology: A new science? The New York Times - Sunday Review. http://www.nytimes.com/2011/10/30/opinion/sunday/twitterology-a-new-science.html?_r=0Google Scholar

Index Terms

Mining semantics for culturomics: towards a knowledge-based approach

Recommendations

Toward an Effective Igbo Part-of-Speech Tagger

Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Read More
A Basic Language Resource Kit Implementation for the IgboNLP Project

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
Read More
A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

New semantic tagger based on a large English historical thesaurus of 793,742 lexemes.Automatically classify words into 225,000 concepts and 4033 thematic categories.Time-sensitive semantic tagger reflecting history of English word usage.Comprehensive ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
October 2013
74 pages
ISBN:9781450324151
DOI:10.1145/2513549
General Chairs:
Xiaozhong Liu
Indiana University,USA
,
Miao Chen
Indiana University, USA
,
Ying Ding
Indiana University, USA
,
Min Song
Yonsei University, South Korea
Copyright © 2013 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2013
Check for updates
Author Tags
language technology
semantic parsing
Qualifiers
- research-article
Conference

Acceptance Rates
UnstructureNLP '13 Paper Acceptance Rate9of12submissions,75%Overall Acceptance Rate9of12submissions,75%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 668
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining semantics for culturomics: towards a knowledge-based approach

UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward an Effective Igbo Part-of-Speech Tagger

A Basic Language Resource Kit Implementation for the IgboNLP Project

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining semantics for culturomics: towards a knowledge-based approach

UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward an Effective Igbo Part-of-Speech Tagger

A Basic Language Resource Kit Implementation for the IgboNLP Project

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media