ABSTRACT
We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.
- Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. Mining text data. Springer, 77--128.Google Scholar
- James Allan. 2012. Topic detection and tracking: event-based information organization. Vol. Vol. 12. Springer Science & Business Media. Google ScholarDigital Library
- James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 37--45. Google ScholarDigital Library
- Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. 2010. Evolutionary Clustering. Springer US. 332--337 pages.Google Scholar
- Pi-Chuan Chang, Michel Galley, and Christopher D Manning. 2008. Optimizing Chinese word segmentation for machine translation performance Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 224--232. Google ScholarDigital Library
- Christos Faloutsos, Kevin S McCurley, and Andrew Tomkins. 2004. Fast discovery of connection subgraphs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 118--127. Google ScholarDigital Library
- Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, and Yanchun Liang. 2011. Text clustering with seeds affinity propagation. IEEE Transactions on Knowledge and Data Engineering, Vol. 23, 4 (2011), 627--637. Google ScholarDigital Library
- Ting Hua, Xuchao Zhang, Wei Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2016. Automatical Storyline Generation with Help from Twitter Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2383--2388. Google ScholarDigital Library
- Lifu Huang and Lian'en Huang. 2013. Optimized Event Storyline Generation based on Mixture-Event-Aspect Model. EMNLP. 726--735.Google Scholar
- Liping Jing, Michael K Ng, and Joshua Z Huang. 2010. Knowledge-based vector space model for text clustering. Knowledge and information systems Vol. 25, 1 (2010), 35--55. Google ScholarDigital Library
- Liping Jing, Michael K Ng, Jun Xu, and Joshua Zhexue Huang. 2005. Subspace clustering of text documents with feature weighting k-means algorithm Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 802--812. Google ScholarDigital Library
- Luying Liu, Jianchu Kang, Jing Yu, and Zhongliang Wang. 2005. A comparative study on unsupervised feature selection methods for text clustering Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on. IEEE, 597--601.Google Scholar
- Ida Mele and Fabio Crestani. 2017. Event Detection for Heterogeneous News Streams. In International Conference on Applications of Natural Language to Information Systems. 110--123.Google Scholar
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. Association for Computational Linguistics.Google Scholar
- Ramesh Nallapati, Ao Feng, Fuchun Peng, and James Allan. 2004. Event threading within news topics. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 446--453. Google ScholarDigital Library
- Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, Vol. 101, 9 (2004), 2658--2663.Google ScholarCross Ref
- Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. EMNLP-CoNLL, Vol. Vol. 7. 410--420.Google Scholar
- Hassan Sayyadi, Matthew Hurst, and Alexey Maykov. 2009. Event detection and tracking in social streams.. Icwsm.Google Scholar
- Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology (TOIT), Vol. 13, 2 (2013), 4. Google ScholarDigital Library
- Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. 2012. Trains of thought: Generating information maps. In Proceedings of the 21st international conference on World Wide Web. ACM, 899--908. Google ScholarDigital Library
- Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang, and Jure Leskovec. 2013. Information cartography: creating zoomable, large-scale maps of information Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1097--1105. Google ScholarDigital Library
- Lu Wang, Claire Cardie, and Galen Marchetti. 2016. Socially-informed timeline generation for complex events. arXiv preprint arXiv:1606.05699 (2016).Google Scholar
- Shize Xu, Shanshan Wang, and Yan Zhang. 2013. Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction. EMNLP. 1281--1291.Google Scholar
- Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011. Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 745--754. Google ScholarDigital Library
- Christopher C Yang, Xiaodong Shi, and Chih-Ping Wei. 2009. Discovering event evolution graphs from news corpora. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, Vol. 39, 4 (2009), 850--863. Google ScholarDigital Library
- Yiming Yang, Jaime Carbonell, Ralf Brown, John Lafferty, Thomas Pierce, and Thomas Ault. 2002. Multi-strategy learning for topic detection and tracking. Topic detection and tracking. Springer, 85--114. Google ScholarDigital Library
- Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles. EMNLP. 1943--1948.Google Scholar
- Growing Story Forest Online from Massive Breaking News
Recommendations
Story Forest: Extracting Events and Telling Stories from Breaking News
Extracting events accurately from vast news corpora and organize events logically is critical for news apps and search engines, which aim to organize news information collected from the Internet and present it to users in the most sensible forms. ...
From Linear Story Generation to Branching Story Graphs
Interactive narrative systems are storytelling systems in which the user can influence the content or ordering of story world events. Conceptually, an interactive narrative can be represented as a branching graph of narrative elements, implying points ...
Say Anything: A Massively Collaborative Open Domain Story Writing Companion
Interactive StorytellingAbstractInteractive storytelling is an interesting cross-disciplinary area that has importance in research as well as entertainment. In this paper we explore a new area of interactive storytelling that blurs the line between traditional interactive ...
Comments