skip to main content
10.1145/3132847.3132852acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Growing Story Forest Online from Massive Breaking News

Published:06 November 2017Publication History

ABSTRACT

We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.

References

  1. Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. Mining text data. Springer, 77--128.Google ScholarGoogle Scholar
  2. James Allan. 2012. Topic detection and tracking: event-based information organization. Vol. Vol. 12. Springer Science & Business Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 37--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. 2010. Evolutionary Clustering. Springer US. 332--337 pages.Google ScholarGoogle Scholar
  5. Pi-Chuan Chang, Michel Galley, and Christopher D Manning. 2008. Optimizing Chinese word segmentation for machine translation performance Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 224--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christos Faloutsos, Kevin S McCurley, and Andrew Tomkins. 2004. Fast discovery of connection subgraphs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 118--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, and Yanchun Liang. 2011. Text clustering with seeds affinity propagation. IEEE Transactions on Knowledge and Data Engineering, Vol. 23, 4 (2011), 627--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ting Hua, Xuchao Zhang, Wei Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2016. Automatical Storyline Generation with Help from Twitter Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2383--2388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lifu Huang and Lian'en Huang. 2013. Optimized Event Storyline Generation based on Mixture-Event-Aspect Model. EMNLP. 726--735.Google ScholarGoogle Scholar
  10. Liping Jing, Michael K Ng, and Joshua Z Huang. 2010. Knowledge-based vector space model for text clustering. Knowledge and information systems Vol. 25, 1 (2010), 35--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Liping Jing, Michael K Ng, Jun Xu, and Joshua Zhexue Huang. 2005. Subspace clustering of text documents with feature weighting k-means algorithm Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 802--812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Luying Liu, Jianchu Kang, Jing Yu, and Zhongliang Wang. 2005. A comparative study on unsupervised feature selection methods for text clustering Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on. IEEE, 597--601.Google ScholarGoogle Scholar
  13. Ida Mele and Fabio Crestani. 2017. Event Detection for Heterogeneous News Streams. In International Conference on Applications of Natural Language to Information Systems. 110--123.Google ScholarGoogle Scholar
  14. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. Association for Computational Linguistics.Google ScholarGoogle Scholar
  15. Ramesh Nallapati, Ao Feng, Fuchun Peng, and James Allan. 2004. Event threading within news topics. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 446--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, Vol. 101, 9 (2004), 2658--2663.Google ScholarGoogle ScholarCross RefCross Ref
  17. Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. EMNLP-CoNLL, Vol. Vol. 7. 410--420.Google ScholarGoogle Scholar
  18. Hassan Sayyadi, Matthew Hurst, and Alexey Maykov. 2009. Event detection and tracking in social streams.. Icwsm.Google ScholarGoogle Scholar
  19. Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology (TOIT), Vol. 13, 2 (2013), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. 2012. Trains of thought: Generating information maps. In Proceedings of the 21st international conference on World Wide Web. ACM, 899--908. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang, and Jure Leskovec. 2013. Information cartography: creating zoomable, large-scale maps of information Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lu Wang, Claire Cardie, and Galen Marchetti. 2016. Socially-informed timeline generation for complex events. arXiv preprint arXiv:1606.05699 (2016).Google ScholarGoogle Scholar
  23. Shize Xu, Shanshan Wang, and Yan Zhang. 2013. Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction. EMNLP. 1281--1291.Google ScholarGoogle Scholar
  24. Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011. Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 745--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Christopher C Yang, Xiaodong Shi, and Chih-Ping Wei. 2009. Discovering event evolution graphs from news corpora. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, Vol. 39, 4 (2009), 850--863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yiming Yang, Jaime Carbonell, Ralf Brown, John Lafferty, Thomas Pierce, and Thomas Ault. 2002. Multi-strategy learning for topic detection and tracking. Topic detection and tracking. Springer, 85--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles. EMNLP. 1943--1948.Google ScholarGoogle Scholar
  1. Growing Story Forest Online from Massive Breaking News

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
      November 2017
      2604 pages
      ISBN:9781450349185
      DOI:10.1145/3132847

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 November 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader