ABSTRACT
The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization.
We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting "Breaking News" in real-time.
Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.
- H. Abdelhaq, M. Gertz, and C. Sengstock. "Spatio-temporal Characteristics of BurstyWords in Twitter Streams". In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), Orlando, FL. 2013, pp. 194--203. DOI: 10.1145/2525314.2525354. Google ScholarDigital Library
- H. Abdelhaq, C. Sengstock, and M. Gertz. "EvenTweet: Online localized event detection from Twitter". In: Proceedings of the VLDB Endowment 6.12 (2013), pp. 1326--1329. Google ScholarDigital Library
- T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. "MillWheel: Fault-Tolerant Stream Processing at Internet Scale". In: Proceedings of the VLDB Endowment 6.11 (2013), pp. 1033--1044. Google ScholarDigital Library
- J. Allan, V. Lavrenko, D. Malin, and R. Swan. "Detections, bounds, and timelines: UMass and TDT-3". In: Proceedings of Topic Detection and Tracking (TDT--3). 2000, pp. 167--174.Google Scholar
- F. Alvanaki, S. Michel, K. Ramamritham, and G. Weikum. "See what's enBlogue: real-time emergent topic identification in social media". In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany. 2012, pp. 336--347. Google ScholarDigital Library
- N. Bansal and N. Koudas. "Blogscope: a system for online analysis of high volume text streams". In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria. 2007, pp. 1410--1413. Google ScholarDigital Library
- B. H. Bloom. "Space/time trade-offs in hash coding with allowable errors". In: Communications of the ACM 13.7 (1970), pp. 422--426. Google ScholarDigital Library
- C. Budak, T. Georgiou, D. Agrawal, and A. El Abbadi. "Geo-Scope: Online detection of geo-correlated information trends in social networks". In: Proceedings of the VLDB Endowment 7.4 (2013), pp. 229--240. Google ScholarDigital Library
- T. M. Chan. "Approximate Nearest Neighbor Queries Revisited". In: Discrete & Computational Geometry 20.3 (1998), pp. 359--373. DOI: 10.1007/PL00009390.Google ScholarCross Ref
- G. Cormode and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications". In: J. Algorithms 55.1 (2005), pp. 58--75. DOI: 10.1016/j.jalgor. 2003.12.001. Google ScholarDigital Library
- T. Finch. Incremental calculation of weighted mean and variance. Tech. rep. University of Cambridge, 2009.Google Scholar
- H.-G. Kim, S. Lee, and S. Kyeong. "Discovering hot topics using Twitter streaming data social topic detection and geographic clustering". In: Proc. ASONAM. 2013. Google ScholarDigital Library
- J. Kleinberg. "Bursty and hierarchical structure in streams". In: Data Mining and Knowledge Discovery 7.4 (2003), pp. 373--397. DOI: 10.1023/A:1024940629314. Google ScholarDigital Library
- V. Lampos, T. De Bie, and N. Cristianini. "Flu detector-tracking epidemics on Twitter". In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain. 2010, pp. 599--602. DOI: 10.1007/978-3-642-15939-8_42. Google ScholarCross Ref
- R. Lee and K. Sumiya. "Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection". In: Proc. LBSN. 2010. Google ScholarDigital Library
- J. Leskovec, L. Backstrom, and J. Kleinberg. "Meme-tracking and the dynamics of the news cycle". In: Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France. 2009, pp. 497--506. Google ScholarDigital Library
- W. Li, C. Eickhoff, and A. P. de Vries. "Geo-spatial Domain Expertise in Microblogs". In: Advances in Information Retrieval - Proceedings of the 36th European Conference on IR Research (ECIR), Amsterdam, Netherlands. 2014, pp. 487--492. DOI: 10.1007/978-3-319-06028-6_46.Google ScholarDigital Library
- M. Mathioudakis and N. Koudas. "Twittermonitor: trend detection over the Twitter stream". In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN. 2010, pp. 1155--1158. Google ScholarDigital Library
- M. Platakis, D. Kotsakos, and D. Gunopulos. "Searching for events in the blogosphere". In: Proceedings of the 18th International Conference on World Wide Web (WWW), Madrid, Spain. 2009, pp. 1225--1226. Google ScholarDigital Library
- T. Sakaki, M. Okazaki, and Y. Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors". In: Proceedings of the 19th International Conference onWorld Wide Web (WWW), Raleigh, NC. 2010, pp. 851--860. Google ScholarDigital Library
- E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. "A Framework for Clustering Uncertain Data". In: Proceedings of the VLDB Endowment 8.12 (2015), pp. 1976--1979. DOI: 10.14778/2824032.2824115. Google ScholarDigital Library
- E. Schubert and OpenStreetMap Contributors. Fast Reverse Geocoder using OpenStreetMap data. Open Data LMU. Dec. 2015. DOI: 10.5282/ubm/data.61.Google Scholar
- E. Schubert, M.Weiler, and H.-P. Kriegel. "SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds". In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), New York, NY. 2014, pp. 871--880. DOI: 10.1145/2623330.2623740. Google ScholarDigital Library
- Y. Takahashi, T. Utsuro, M. Yoshioka, N. Kando, T. Fukuhara, H. Nakagawa, and Y. Kiyota. "Applying a Burst Model to Detect Bursty Topics in a Topic Model". In: Advances in Natural Language Processing -- Proceedings of the 8th International Conference on NLP, JapTAL 2012, Kanazawa, Japan, October. 2012, pp. 239--249. DOI: 10.1007/978-3-642-33983-7_24.Google Scholar
- G. B. Tran and M. Alrifai. "Indexing and analyzing Wikipedia's current events portal, the daily news summaries by the crowd". In: Proceedings of the 23rd International Conference onWorld Wide Web (WWW), Seoul, Korea. 2014, pp. 511--516. Google ScholarDigital Library
- X. Wang, Y. Zhang, W. Zhang, and X. Lin. "Efficiently identify local frequent keyword co-occurrence patterns in geotagged Twitter stream". In: Proceedings of the 37th International Conference on Research and Development in Information Retrieval (SIGIR), Gold Coast, QLD, Australia. 2014, pp. 1215--1218. Google ScholarDigital Library
- B. P. Welford. "Note on a Method for Calculating Corrected Sums of Squares and Products". In: Technometrics 4.3 (1962), pp. 419--420. DOI: 10.2307/1266577.Google ScholarCross Ref
- D. H. D. West. "Updating mean and variance estimates: an improved method". In: Communications of the ACM 22.9 (1979), pp. 532--535. DOI: 10.1145/359146.359153. Google ScholarDigital Library
- Y. Yang, T. Pierce, and J. Carbonell. "A study of retrospective and on-line event detection". In: Proceedings of the 32nd International Conference on Research and Development in Information Retrieval (SIGIR), Boston, MA. 1998, pp. 28--36. Google ScholarDigital Library
Recommendations
SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningSocial media such as Twitter or weblogs are a popular source for live textual data. Much of this popularity is due to the fast rate at which this data arrives, and there are a number of global events - such as the Arab Spring - where Twitter is reported ...
Wisdom of the local crowd: detecting local events using social media data
WebSci '16: Proceedings of the 8th ACM Conference on Web ScienceEvent attendees post about their experiences on social media. We propose a novel approach for analyzing these posts to extract ongoing events. We gather posts from Twitter and Instagram and perform a number of processing steps to identify posts related ...
LEDS: local event discovery and summarization from tweets
SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsTwitter is one of the most popular social media platforms where people can share their opinions, thoughts, interests, and whereabouts. In this work, we propose a Local Event Discovery and Summarization (LEDS) framework to detect local events from ...
Comments