skip to main content
10.5555/1109557.1109683acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
Article

The hunting of the bump: on maximizing statistical discrepancy

Published: 22 January 2006 Publication History

Abstract

Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm running in O(1/ε n2 log2n) that computes the maximum discrepancy rectangle to within additive error ε, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n4).

References

[1]
Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. Diamond in the rough: finding hierarchical heavy hitters in multi-dimensional data. In SIGMOD (2004), pp. 155--166.
[2]
Cormode, G., and Muthukrishnan, S. What's hot and what's not: tracking most frequent items dynamically. In PODS (2003), pp. 296--306.
[3]
Cressie, N. Statistics for spatial data, 2nd ed. John Wiley, 1993.
[4]
Dobkin, D., and Eppstein, D. Computing the discrepancy. Symposium on Computational Geometry 9 (1993).
[5]
Dobkin, D. P., Gunopulos, D., and Maass, W. Computing the maximum bichromatic discrepancy, with applications to computer graphics and machine learning. NeuroCOLT Technical Report Series NC-TR-95-008 (March 1995).
[6]
Friedman, J. H., and Fisher, N. I. Bump hunting in high-dimensional data. Statistics and Computing 9, 2 (April 1999), 123--143.
[7]
Good, P. Permutation tests - a practical guide to resampling methods for testing hypotheses. Springer-Verlag, 2nd edition, New York, 2000.
[8]
Iyengar, V. On detecting space-time clusters. KDD (2004), 587--592.
[9]
J. Glaz, and Balakrishnan, N. Scan statistics and applications. Birkhauser, Boston, 1999.
[10]
J. Glaz, J. Naus, and S. Wallenstein. Scan statistics. Springer-Verlag, New York, 2001.
[11]
Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N., and Yakhini, Z. Efficient calculation of interval scores for dna copy number data analysis. RECOMB (2005), 83--100.
[12]
M. Dwass. Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics 28 (1957), 181--187.
[13]
M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods 26 (1997), 1481--1496.
[14]
M. Kulldorff. Prospective time-periodic geographical disease surveillance using a scan statistic. Journal of the Royal statistical society, series A 164 (2001), 61--72.
[15]
Neill, D. B., and Moore, A. W. A fast multiresolution method for detection of significant spatial disease clusters. Advances in Neural Information Processing Systems 10 (2004), 651--658.
[16]
Neill, D. B., and Moore, A. W. Rapid detection of significant spatial clusters. In KDD (2004).

Cited By

View all
  • (2022)Statistically-Robust Clustering Techniques for Mapping Spatial Hotspots: A SurveyACM Computing Surveys10.1145/348789355:2(1-38)Online publication date: 18-Jan-2022
  • (2020)Hunting multiple bumps in graphsProceedings of the VLDB Endowment10.14778/3377369.337737513:5(656-669)Online publication date: 1-Jan-2020
  • (2016)Scalable spatial scan statistics through samplingProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2996913.2996939(1-10)Online publication date: 31-Oct-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
January 2006
1261 pages
ISBN:0898716055

Sponsors

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 22 January 2006

Check for updates

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Statistically-Robust Clustering Techniques for Mapping Spatial Hotspots: A SurveyACM Computing Surveys10.1145/348789355:2(1-38)Online publication date: 18-Jan-2022
  • (2020)Hunting multiple bumps in graphsProceedings of the VLDB Endowment10.14778/3377369.337737513:5(656-669)Online publication date: 1-Jan-2020
  • (2016)Scalable spatial scan statistics through samplingProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2996913.2996939(1-10)Online publication date: 31-Oct-2016
  • (2015)CNVScanProceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/2808719.2808754(335-344)Online publication date: 9-Sep-2015
  • (2014)Event detection in activity networksProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2623330.2623674(1176-1185)Online publication date: 24-Aug-2014
  • (2010)Identifying, attributing and describing spatial burstsProceedings of the VLDB Endowment10.14778/1920841.19209783:1-2(1091-1102)Online publication date: 1-Sep-2010
  • (2010)A Model-Agnostic Framework for Fast Spatial Anomaly DetectionACM Transactions on Knowledge Discovery from Data10.1145/1857947.18579524:4(1-30)Online publication date: 1-Oct-2010
  • (2009)A LRT framework for fast spatial anomaly detectionProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557116(887-896)Online publication date: 28-Jun-2009
  • (2009)On burstiness-aware search for document sequencesProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557075(477-486)Online publication date: 28-Jun-2009
  • (2009)Region-restricted clustering for geographic data miningComputational Geometry: Theory and Applications10.1016/j.comgeo.2008.08.00342:3(231-240)Online publication date: 1-Apr-2009
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media