skip to main content
10.1145/2854946.2854959acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

History by Diversity: Helping Historians search News Archives

Published: 13 March 2016 Publication History

Abstract

Longitudinal corpora like newspaper archives are of immense value to historical research, and time as an important factor for historians strongly influences their search behaviour in these archives. While searching for articles published over time, a key preference is to retrieve documents which cover the important aspects from important points in time which is different from standard search behavior. To support this search strategy, we introduce the notion of a Historical Query Intent to explicitly model a historian's search task and define an aspect-time diversification problem over news archives.
We present a novel algorithm, HistDiv, that explicitly models the aspects and important time windows based on a historian's information seeking behavior. By incorporating temporal priors based on publication times and temporal expressions, we diversify both on the aspect and temporal dimensions. We test our methods by constructing a test collection based on The New York Times Collection with a workload of 30 queries of historical intent assessed manually. We find that HistDiv outperforms all competitors in subtopic recall with a slight loss in precision. We also present results of a qualitative user study to determine wether this drop in precision is detrimental to user experience. Our results show that users still preferred HistDiv's ranking.

References

[1]
British newspaper archive http://www.britishnewspaperarchive.co.uk/.
[2]
California digital newspaper collection, http://cdnc.ucr.edu/cgi-bin/cdnc.
[3]
New york times archives, http://timesmachine.nytimes.com/browser.
[4]
Wikiminer, http://wikipedia-miner.cms.waikato.ac.nz.
[5]
R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM '09, pages 5--14, New York, NY, USA, 2009. ACM.
[6]
O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35--41, Dec. 2007.
[7]
O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and exploring search results using timeline constructions. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pages 97--106, New York, NY, USA, 2009. ACM.
[8]
A. Anand, S. Bedathur, K. Berberich, and R. Schenkel. Index maintenance for time-travel text search. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12, pages 235--244, New York, NY, USA, 2012. ACM.
[9]
K. Berberich and S. Bedathur. Temporal diversification of search results. In SIGIR 2013 Workshop on Time-aware Information Access (TAIA 2013), 2013.
[10]
K. Berberich, S. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for temporal information needs. In Proceedings of the 32Nd European Conference on Advances in Information Retrieval, ECIR'2010, pages 13--25, Berlin, Heidelberg, 2010. Springer-Verlag.
[11]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 519--526, New York, NY, USA, 2007. ACM.
[12]
A. Bingham. The digitization of newspaper archives: Opportunities and challenges for historians. Twentieth Century British History, 21(2):225--231, 2010.
[13]
M. Brucato and D. Montesi. Metric spaces for temporal information retrieval. In M. d. Rijke, T. Kenter, A. P. d. Vries, C. Zhai, F. d. Jong, K. Radinsky, and K. Hofmann, editors, Advances in Information Retrieval, number 8416 in Lecture Notes in Computer Science, pages 385--397. Springer International Publishing, Jan. 2014.
[14]
R. Campos, G. Dias, A. M. Jorge, and A. Jatowt. Survey of temporal information retrieval and related applications. ACM Comput. Surv., 47(2):15:1--15:41, Aug. 2014.
[15]
R. Campos, A. M. Jorge, G. Dias, and C. Nunes. Disambiguating implicit temporal queries by clustering top relevant dates in web snippets. In Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT '12, pages 1--8, Washington, DC, USA, 2012. IEEE Computer Society.
[16]
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335--336. ACM, 1998.
[17]
B. Carterette and P. Chandar. Probabilistic models of ranking novel documents for faceted topic retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pages 1287--1296, New York, NY, USA, 2009. ACM.
[18]
D. O. Case. The collection and use of information by some american historians: a study of motives and methods. The Library Quarterly, pages 61--82, 1991.
[19]
J. Choi and W. B. Croft. Temporal models for microblogs. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2491--2494. ACM, 2012.
[20]
C. Clarke, N. Craswell, I. Soboroff, and E. Voorhees. Nist, overview of the trec2011 web track. In Proceedings of TREC, pages 500--295, 2011.
[21]
C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. Technical report, DTIC Document, 2009.
[22]
C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 659--666. ACM, 2008.
[23]
V. Dang and B. W. Croft. Term level search result diversification. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13, pages 603--612, New York, NY, USA, 2013. ACM.
[24]
V. Dang and W. B. Croft. Diversity by proportionality: An election-based approach to search result diversification. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12, pages 65--74, New York, NY, USA, 2012. ACM.
[25]
A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner, C. Liao, and F. Diaz. Towards recency ranking in web search. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pages 11--20, New York, NY, USA, 2010. ACM.
[26]
A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner, C. Liao, and F. Diaz. Towards recency ranking in web search. In Proceedings of the third ACM international conference on Web search and data mining, pages 11--20. ACM, 2010.
[27]
A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of the essence: Improving recency ranking using twitter data. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 331--340, New York, NY, USA, 2010. ACM.
[28]
Z. Dou, S. Hu, K. Chen, R. Song, and J.-R. Wen. Multi-dimensional search result diversification. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM '11, pages 475--484, New York, NY, USA, 2011. ACM.
[29]
W. M. Duff and C. A. Johnson. Accidentally found on purpose: information-seeking behavior of historians in archives. The Library Quarterly, pages 472--496, 2002.
[30]
D. Gupta and K. Berberich. Identifying time intervals of interest to queries. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, pages 1835--1838, New York, NY, USA, 2014. ACM.
[31]
J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782--792. Association for Computational Linguistics, 2011.
[32]
H. Joho, A. Jatowt, and R. Blanco. Ntcir temporalia: a test collection for temporal information access research. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 845--850. International World Wide Web Conferences Steering Committee, 2014.
[33]
R. Jones and F. Diaz. Temporal profiles of queries. ACM Trans. Inf. Syst., 25(3), July 2007.
[34]
X. Li and W. B. Croft. Time-based language models. In Proceedings of the twelfth international conference on Information and knowledge management, pages 469--475. ACM, 2003.
[35]
S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '14, pages 303--312, New York, NY, USA, 2014. ACM.
[36]
D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal queries. In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '09, pages 700--701, New York, NY, USA, 2009. ACM.
[37]
T. N. Nguyen and N. Kanhabua. Leveraging dynamic query subtopics for time-aware search result diversification. In ECIR, pages 222--234, 2014.
[38]
M.-H. Peetz, E. Meij, M. de Rijke, and W. Weerkamp. Adaptive temporal query modeling. In Advances in Information Retrieval, pages 455--458. Springer, 2012.
[39]
K. Radinsky, K. M. Svore, S. T. Dumais, M. Shokouhi, J. Teevan, A. Bocharov, and E. Horvitz. Behavioral dynamics on the web: Learning, modeling, and prediction. ACM Trans. Inf. Syst., 31(3):16:1--16:37, Aug. 2013.
[40]
R. L. Santos, C. Macdonald, and I. Ounis. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pages 881--890. ACM, 2010.
[41]
V. Setty, S. Bedathur, K. Berberich, and G. Weikum. Inzeit: Efficiently identifying insightful time points. Proc. VLDB Endow., 3(1--2):1605--1608, Sept. 2010.
[42]
C. Smith. Historians and information. 2004.
[43]
H. R. Tibbo. Primarily history in america: How us historians search for primary materials at the dawn of the digital age. American Archivist, 66(1):9--50, 2003.
[44]
K. Zhou, S. Whiting, J. M. Jose, and M. Lalmas. The impact of temporal intent variability on diversity evaluation. In Advances in Information Retrieval, pages 820--823. Springer, 2013.
[45]
Y. Zhu, Y. Lan, J. Guo, X. Cheng, and S. Niu. Learning for search result diversification. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '14, pages 293--302, New York, NY, USA, 2014. ACM.

Cited By

View all
  • (2024)Data Augmentation for Sample Efficient and Robust Document RankingACM Transactions on Information Systems10.1145/363491142:5(1-29)Online publication date: 29-Apr-2024
  • (2023)Is this news article still relevant? Ranking by contemporary relevance in archival searchInternational Journal on Digital Libraries10.1007/s00799-023-00377-y25:2(197-216)Online publication date: 28-Jul-2023
  • (2023)Present Causal Relationship Retrieval for Historical AnalogyCulture and Computing10.1007/978-3-031-34732-0_41(536-547)Online publication date: 9-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval
March 2016
400 pages
ISBN:9781450337519
DOI:10.1145/2854946
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. diversity
  2. historian
  3. historical query intent
  4. historical search task
  5. joint diversification
  6. news archives
  7. temporal diversity
  8. test collection

Qualifiers

  • Research-article

Funding Sources

Conference

CHIIR '16
Sponsor:
CHIIR '16: Conference on Human Information Interaction and Retrieval
March 13 - 17, 2016
North Carolina, Carrboro, USA

Acceptance Rates

CHIIR '16 Paper Acceptance Rate 23 of 58 submissions, 40%;
Overall Acceptance Rate 55 of 163 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Data Augmentation for Sample Efficient and Robust Document RankingACM Transactions on Information Systems10.1145/363491142:5(1-29)Online publication date: 29-Apr-2024
  • (2023)Is this news article still relevant? Ranking by contemporary relevance in archival searchInternational Journal on Digital Libraries10.1007/s00799-023-00377-y25:2(197-216)Online publication date: 28-Jul-2023
  • (2023)Present Causal Relationship Retrieval for Historical AnalogyCulture and Computing10.1007/978-3-031-34732-0_41(536-547)Online publication date: 9-Jul-2023
  • (2022)User Access Models to Event-Centric InformationCompanion Proceedings of the Web Conference 202210.1145/3487553.3524193(329-333)Online publication date: 25-Apr-2022
  • (2021)Semantically Time Tracking of Events from Web DocumentsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3470482.3479627(141-144)Online publication date: 5-Nov-2021
  • (2021)Event Occurrence Date Estimation based on Multivariate Time Series Analysis over Temporal Document CollectionsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462885(398-407)Online publication date: 11-Jul-2021
  • (2021)Estimating Contemporary Relevance of Past News2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)10.1109/JCDL52503.2021.00019(70-79)Online publication date: Sep-2021
  • (2021)StoryTracker: A Semantic-Oriented Tool for Automatic Tracking Events by Web DocumentsComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-86970-0_10(126-140)Online publication date: 11-Sep-2021
  • (2021)Full-Text and URL Search Over Web ArchivesThe Past Web10.1007/978-3-030-63291-5_7(71-84)Online publication date: 1-Jul-2021
  • (2020)Analyzing history-related posts in twitterInternational Journal on Digital Libraries10.1007/s00799-020-00296-2Online publication date: 28-Oct-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media