skip to main content
10.1145/2949689.2949699acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams

Published:18 July 2016Publication History

ABSTRACT

The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization.

We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting "Breaking News" in real-time.

Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.

References

  1. H. Abdelhaq, M. Gertz, and C. Sengstock. "Spatio-temporal Characteristics of BurstyWords in Twitter Streams". In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), Orlando, FL. 2013, pp. 194--203. DOI: 10.1145/2525314.2525354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Abdelhaq, C. Sengstock, and M. Gertz. "EvenTweet: Online localized event detection from Twitter". In: Proceedings of the VLDB Endowment 6.12 (2013), pp. 1326--1329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. "MillWheel: Fault-Tolerant Stream Processing at Internet Scale". In: Proceedings of the VLDB Endowment 6.11 (2013), pp. 1033--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Allan, V. Lavrenko, D. Malin, and R. Swan. "Detections, bounds, and timelines: UMass and TDT-3". In: Proceedings of Topic Detection and Tracking (TDT--3). 2000, pp. 167--174.Google ScholarGoogle Scholar
  5. F. Alvanaki, S. Michel, K. Ramamritham, and G. Weikum. "See what's enBlogue: real-time emergent topic identification in social media". In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany. 2012, pp. 336--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Bansal and N. Koudas. "Blogscope: a system for online analysis of high volume text streams". In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria. 2007, pp. 1410--1413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. H. Bloom. "Space/time trade-offs in hash coding with allowable errors". In: Communications of the ACM 13.7 (1970), pp. 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Budak, T. Georgiou, D. Agrawal, and A. El Abbadi. "Geo-Scope: Online detection of geo-correlated information trends in social networks". In: Proceedings of the VLDB Endowment 7.4 (2013), pp. 229--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. M. Chan. "Approximate Nearest Neighbor Queries Revisited". In: Discrete & Computational Geometry 20.3 (1998), pp. 359--373. DOI: 10.1007/PL00009390.Google ScholarGoogle ScholarCross RefCross Ref
  10. G. Cormode and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications". In: J. Algorithms 55.1 (2005), pp. 58--75. DOI: 10.1016/j.jalgor. 2003.12.001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Finch. Incremental calculation of weighted mean and variance. Tech. rep. University of Cambridge, 2009.Google ScholarGoogle Scholar
  12. H.-G. Kim, S. Lee, and S. Kyeong. "Discovering hot topics using Twitter streaming data social topic detection and geographic clustering". In: Proc. ASONAM. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Kleinberg. "Bursty and hierarchical structure in streams". In: Data Mining and Knowledge Discovery 7.4 (2003), pp. 373--397. DOI: 10.1023/A:1024940629314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. Lampos, T. De Bie, and N. Cristianini. "Flu detector-tracking epidemics on Twitter". In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain. 2010, pp. 599--602. DOI: 10.1007/978-3-642-15939-8_42. Google ScholarGoogle ScholarCross RefCross Ref
  15. R. Lee and K. Sumiya. "Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection". In: Proc. LBSN. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Leskovec, L. Backstrom, and J. Kleinberg. "Meme-tracking and the dynamics of the news cycle". In: Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France. 2009, pp. 497--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Li, C. Eickhoff, and A. P. de Vries. "Geo-spatial Domain Expertise in Microblogs". In: Advances in Information Retrieval - Proceedings of the 36th European Conference on IR Research (ECIR), Amsterdam, Netherlands. 2014, pp. 487--492. DOI: 10.1007/978-3-319-06028-6_46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Mathioudakis and N. Koudas. "Twittermonitor: trend detection over the Twitter stream". In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN. 2010, pp. 1155--1158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Platakis, D. Kotsakos, and D. Gunopulos. "Searching for events in the blogosphere". In: Proceedings of the 18th International Conference on World Wide Web (WWW), Madrid, Spain. 2009, pp. 1225--1226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Sakaki, M. Okazaki, and Y. Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors". In: Proceedings of the 19th International Conference onWorld Wide Web (WWW), Raleigh, NC. 2010, pp. 851--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. "A Framework for Clustering Uncertain Data". In: Proceedings of the VLDB Endowment 8.12 (2015), pp. 1976--1979. DOI: 10.14778/2824032.2824115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Schubert and OpenStreetMap Contributors. Fast Reverse Geocoder using OpenStreetMap data. Open Data LMU. Dec. 2015. DOI: 10.5282/ubm/data.61.Google ScholarGoogle Scholar
  23. E. Schubert, M.Weiler, and H.-P. Kriegel. "SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds". In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), New York, NY. 2014, pp. 871--880. DOI: 10.1145/2623330.2623740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Takahashi, T. Utsuro, M. Yoshioka, N. Kando, T. Fukuhara, H. Nakagawa, and Y. Kiyota. "Applying a Burst Model to Detect Bursty Topics in a Topic Model". In: Advances in Natural Language Processing -- Proceedings of the 8th International Conference on NLP, JapTAL 2012, Kanazawa, Japan, October. 2012, pp. 239--249. DOI: 10.1007/978-3-642-33983-7_24.Google ScholarGoogle Scholar
  25. G. B. Tran and M. Alrifai. "Indexing and analyzing Wikipedia's current events portal, the daily news summaries by the crowd". In: Proceedings of the 23rd International Conference onWorld Wide Web (WWW), Seoul, Korea. 2014, pp. 511--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Wang, Y. Zhang, W. Zhang, and X. Lin. "Efficiently identify local frequent keyword co-occurrence patterns in geotagged Twitter stream". In: Proceedings of the 37th International Conference on Research and Development in Information Retrieval (SIGIR), Gold Coast, QLD, Australia. 2014, pp. 1215--1218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. P. Welford. "Note on a Method for Calculating Corrected Sums of Squares and Products". In: Technometrics 4.3 (1962), pp. 419--420. DOI: 10.2307/1266577.Google ScholarGoogle ScholarCross RefCross Ref
  28. D. H. D. West. "Updating mean and variance estimates: an improved method". In: Communications of the ACM 22.9 (1979), pp. 532--535. DOI: 10.1145/359146.359153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang, T. Pierce, and J. Carbonell. "A study of retrospective and on-line event detection". In: Proceedings of the 32nd International Conference on Research and Development in Information Retrieval (SIGIR), Boston, MA. 1998, pp. 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader