Controlling False Discoveries During Interactive Data Exploration

Authors:
Zheguang Zhao

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Lorenzo De Stefani

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Emanuel Zgraggen

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Carsten Binnig

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Eli Upfal

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Tim Kraska

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 527–540https://doi.org/10.1145/3035918.3064019

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 527–540

ABSTRACT

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution to integrate the control of multiple hypothesis testing into interactive data exploration systems. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable to interactive data exploration. We therefore discuss a set of new control procedures that are better suited for this task and integrate them in our system, QUDE. Via extensive experiments on both real-world and synthetic data sets we demonstrate how QUDE can help experts and novice users alike to efficiently control false discoveries.

References

E. Aharoni and S. Rosset. Generalized-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.Google Scholar
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289--300, 1995.Google Scholar
Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.Google Scholar
D. A. Berry et al. Bayesian perspectives on multiple comparisons. Journal of Statistical Planning and Inference, 82(1--2), 1999.Google ScholarCross Ref
A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. arXiv preprint arXiv:1502.04585, 2015.Google Scholar
C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber, 1936.Google Scholar
A. Burgess, R. Wagner, R. Jennings, and H. B. Barlow. Efficiency of human visual signal discrimination. Science, 214(4516):93--94, 1981.Google ScholarCross Ref
A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: Interactive analytics through pen and touch. Proceedings of the VLDB Endowment, 8(12):2024--2027, 2015. Google ScholarDigital Library
J. Demšar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1--30, Dec. 2006. Google ScholarDigital Library
E. Dimara, A. Bezerianos, and P. Dragicevic. The attraction effect in information visualization. IEEE Trans. Vis. Comput. Graph., 23(1), 2016. Google ScholarDigital Library
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126. ACM, 2015. Google ScholarDigital Library
B. Efron and T. Hastie. Computer Age Statistical Inference, volume 5. Cambridge University Press, 2016. Google ScholarCross Ref
R. Fisher. The design of experiments. Oliver and Boyd, Edinburgh, Scotland, 1935.Google Scholar
D. P. Foster and R. A. Stine. α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429--444, 2008.Google Scholar
M. G. G'Sell et al. Sequential selection procedures and false discovery rate control. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(2), 2016.Google Scholar
H. Guo, S. Gomez, C. Ziemkiewicz, and D. Laidlaw. A case study using visualization interaction logs and insight. IEEE Trans. Vis. Comput. Graph., 2016.Google ScholarDigital Library
P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In SIGMOD, 2012. Google ScholarDigital Library
Y. Hochberg. A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800--802, 1988.Google ScholarCross Ref
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65--70, 1979.Google Scholar
J. P. A. Ioannidis. Why most published research findings are false. Plos Med, 2(8), 2005.Google Scholar
H. Jeffreys. The theory of probability. OUP Oxford, 1998.Google Scholar
M. I. Jordan. The era of big data. ISBA Bulletin, 18(2), 2011.Google Scholar
N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 472--483. IEEE, 2014.Google ScholarCross Ref
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'95, pages 1137--1143, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
M. Lichman. UCI machine learning repository, 2013.Google Scholar
Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In Computer Graphics Forum, volume 32, pages 421--430. Wiley Online Library, 2013. Google ScholarDigital Library
J. H. McDonald. Handbook of Biological Statistics. Sparky House Publishing, Baltimore, Maryland, USA, second edition, 2009.Google Scholar
J. Neyman and E. L. Scott. Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society, pages 1--32, 1948.Google ScholarCross Ref
P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.Google Scholar
P. Refaeilzadeh, L. Tang, H. Liu, and M. T. ÖZSU. Cross-Validation, pages 532--538. Springer US, Boston, MA, 2009.Google Scholar
M. Schemper. A survey of permutation tests for censored survival data. Communications in Statistics-Theory and Methods, 13(13):1655--1665, 1984.Google ScholarCross Ref
J. P. Shaffer. Multiple hypothesis testing. Annual review of psychology, 46, 1995.Google Scholar
Y. B. Shrinivasan and J. J. van Wijk. Supporting the analytical reasoning process in information visualization. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1237--1246. ACM, 2008. Google ScholarDigital Library
Z. Šidák. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318):626--633, 1967.Google Scholar
R. J. Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751--754, 1986.Google ScholarCross Ref
E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and T. Kraska. How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph., 2016.Google Scholar
A. F. Zuur, E. N. Ieno, and C. S. Elphick. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1):3--14, 2010.Google ScholarCross Ref

Index Terms

Controlling False Discoveries During Interactive Data Exploration

Recommendations

Safe Visual Data Exploration
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Exploring data via visualization has become a popular way to understand complex data. Features or patterns in visualization can be perceived as relevant insights by users, even though they may actually arise from random noise. Moreover, interactive data ...
Read More
An interactive visualization environment for data exploration
KDD'97: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining

Exploratory data analysis is a process of sifting through data in search of interesting information or patterns. Analysts' current tools for exploring data include database management systems, statistical analysis packages, data mining tools, ...
Read More
Interactive visualization of serial periodic data
UIST '98: Proceedings of the 11th annual ACM symposium on User interface software and technology
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
alpha investing
bonferroni
data analytics
false discovery control
false discovery rate
family-wise error rate
hypothesis testing
interactive data exploration
multiple comparisons problem
multiple hypothesis error
visualization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 1,845
  Total Downloads
- Downloads (Last 12 months)231
- Downloads (Last 6 weeks)57
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Controlling False Discoveries During Interactive Data Exploration

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Safe Visual Data Exploration

An interactive visualization environment for data exploration

Interactive visualization of serial periodic data