skip to main content
research-article

Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics

Published: 29 May 2018 Publication History

Abstract

During data preprocessing, analysts spend a significant part of their time and effort profiling the quality of the data along with cleansing and transforming the data for further analysis. While quality metrics—ranging from general to domain-specific measures—support assessment of the quality of a dataset, there are hardly any approaches to visually support the analyst in customizing and applying such metrics. Yet, visual approaches could facilitate users’ involvement in data quality assessment. We present MetricDoc, an interactive environment for assessing data quality that provides customizable, reusable quality metrics in combination with immediate visual feedback. Moreover, we provide an overview visualization of these quality metrics along with error visualizations that facilitate interactive navigation of the data to determine the causes of quality issues present in the data. In this article, we describe the architecture, design, and evaluation of MetricDoc, which underwent several design cycles, including heuristic evaluation and expert reviews as well as a focus group with data quality, human-computer interaction, and visual analytics experts.

Supplementary Material

bors (bors.zip)
Supplemental movie, appendix, image and software files for, Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics

References

[1]
José Barateiro and Helena Galhardas. 2005. A survey of data quality tools.Datenbank-Spektrum 14, 15--21 (2005), 48.
[2]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Computing Surveys 41, 3 (July 2009), 16:1--16:52.
[3]
Carlo Batini and Monica Scannapieco. 2006. Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer Verlag New York,Secaucus, NJ.
[4]
Jürgen Bernard, Tobias Ruppert, Oliver Goroll, Thorsten May, and Jörn Kohlhammer. 2012. Visual-interactive preprocessing of time series data. In Proceedings of SIGRAD 2012: Interactive Visual Analysis of Data. 39--48.
[5]
Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 data-driven documents. IEEE Transactions on Visualization and Computer Graphics 17, 12 (Dec. 2011), 2301--2309.
[6]
K. Brennan. 2009. A Guide to the Business Analysis Body of Knowledger. International Institute of Business Analysis.
[7]
Daniel Castellani Ribeiro, Huy Vo, Juliana Freire, and Cláudio Silva. 2015. An urban data profiler. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, 1389--1394.
[8]
Dai Clegg and Richard Barker. 1994. Case Method Fast-Track: A Rad Approach. Addison-Wesley Longman Publishing Co.
[9]
Catherine Courage and Kathy Baxter. 2004. Understanding Your Users: A Practical Guide to User Requirements Methods, Tools, and Techniques. Morgan Kaufmann Publishers.
[10]
Tamraparni Dasu. 2013. Data glitches: Monsters in your data. In Handbook of Data Quality, Shazia Sadiq (Ed.). Springer, Berlin, 163--178.
[11]
DataTables. 2017. DataTables | Table plug-in for jQuery. Retrieved from https://datatables.net/ (accessed May 2017).
[12]
Jeremy Debattista, Makx Dekkers, Christophe Guret, Deirdre Lee, Nandana Mihindukulasooriya, and Amrapali Zaveri. 2016. Data on the Web Best Practices: Data Quality Vocabulary. Retrieved from https://www.w3.org/TR/vocab-dqv/.
[13]
Wayne W. Eckerson. 2002. Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data. Technical Report. The Data Warehousing Institute. Retrieved from tdwi.org/research/2002/02/tdwis-data-quality-report.aspx (accessed April 17, 2014).
[14]
Camilla Forsell and Jimmy Johansson. 2010. An heuristic set for evaluation in information visualization. In Proceedings of the International Conference on Advanced Visual Interfaces (AVI’10). ACM, 199--206.
[15]
Carla M. D. S. Freitas, Marcelo S. Pimenta, and Dominique L. Scapin. 2014. User-centered evaluation of information visualization techniques: Making the HCI-InfoVis connection explicit. In Handbook of Human Centric Visualization, Weidong Huang (Ed.). Springer, 315--336.
[16]
Wilbert O. Galitz. 2007. The Essential Guide to User Interface Design: An Introduction to GUI Design Principles and Techniques. Wiley 8 Sons.
[17]
Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, and Nik Suchy. 2014. TimeCleanser: A visual analytics approach for data cleansing of time-oriented data. In Proceedings of the 14th International Conference on Knowledge Technologies and Data-driven Business (i-KNOW’14). ACM, New York, 18:1--18:8.
[18]
Theresia Gschwandtner, Johannes Gärtner, Wolfgang Aigner, and Silvia Miksch. 2012. A taxonomy of dirty timeoriented data. In Lecture Notes in Computer Science (LNCS 7465): Multidisciplinary Research and Practice for Information Systems (Proceedings of the CD-ARES’12), Gerald Quirchmayr, Josef Basl, Ilsun You, Lida Xu, and EdgarWeippl (Eds.). Springer, Berlin, 58--72.
[19]
Rex Hartson and Pardha A. Pyla. 2012. The UX Book: Process and Guidelines for Ensuring a Quality User Experience. Morgan Kaufmann.
[20]
Jeffrey Heer, Joseph Hellerstein, and Sean Kandel. 2015. Predictive interaction for data transformation. In Conference on Innovative Data Systems Research (CIDR’15).
[21]
Tobias Isenberg, Petra Isenberg, Jian Chen, Michael Sedlmair, and Torsten Möller. 2013. A systematic review on the practice of evaluating visualization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 2818--2827.
[22]
Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization Journal 10, 4 (2011), 271--288.
[23]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’11). ACM, New York, 3363--3372.
[24]
Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI’12). ACM, New York, 547--554.
[25]
Daniel Keim. 2002. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 1--8.
[26]
Andreas Kerren, Achim Ebert, and Jörg Meyer. 2006. Introduction to human-centered visualization environments. In Human-Centered Visualization Environments (Lecture Notes in Computer Science), Andreas Kerren, Achim Ebert, and Jörg Meyer (Eds.). Springer, 1--9.
[27]
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A taxonomy of dirty data. Data Mining and Knowledge Discovery 7, 1 (Jan. 2003), 81--99.
[28]
Jenny Kitzinger. 1995. Qualitative research: Introducing focus groups. BMJ 311, 7000 (1995), 299--302.
[29]
Simone Kriglstein, Margit Pohl, and Michael Smuc. 2014. Pep up your time machine: Recommendations for the design of information visualizations of time-dependent data. In Handbook of Human Centric Visualization, Weidong Huang (Ed.). Springer New York, 203--225.
[30]
Simone Kriglstein, Margit Pohl, Nikolaus Suchy, Johannes Gärtner, Theresia Gschwandtner, and Silvia Miksch. 2014. Experiences and challenges with evaluation methods in practice: A case study. In Proceedings of the 5th Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV’14). ACM, 118--125.
[31]
Simone Kriglstein and Günter Wallner. 2013. Human centered design in practice: A case study with the ontology visualization tool knoocks. In Computer Vision, Imaging and Computer Graphics. Theory and Applications, Gabriela Csurka, Martin Kraus, Leonid Mestetskiy, Paul Richard, and Jos Braz (Eds.). Springer, 123--141.
[32]
Olga A. Kulyk, Robert Kosara, Jaime Urquiza-Fuentes, and Ingo H. C. Wassink. 2006. Human-centered aspects. In Human-Centered Visualization Environments (Lecture Notes in Computer Science), Andreas Kerren, Achim Ebert, and Jörg Meyer (Eds.). Springer, 13--75.
[33]
Martin Maguire and Nigel Bevan. 2002. User requirements analysis: A review of supporting methods. In Proceedings of the IFIP 17th World Computer Congress - TC13 Stream on Usability: Gaining a Competitive Edge. Kluwer, B.V., 133--148.
[34]
Silvia Miksch and Wolfgang Aigner. 2014. A matter of time: Applying a data-users-tasks design triangle to visual analytics of time-oriented data. Computers 8 Graphics, Special Section on Visual Analytics 38 (2014), 286--290.
[35]
Heiko Müller and Johann-Christoph Freytag. 2003. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report. HUB-IB-164, Humboldt University Berlin, Berlin. 50--99 pages.
[36]
Tamara Munzner. 2014. Visualization Analysis and Design. A K Peters Visualization Series, CRC Press.
[37]
Jakob Nielsen. 1994. Usability Inspection Methods. Wiley 8 Sons, Inc., Chapter Heuristic Evaluation, 25--62.
[38]
Paulo Oliveira, Fátima Rodrigues, and Pedro Rangel Henriques. 2005. A formal definition of data quality problems. In IQ.
[39]
Open Refine. 2017. OpenRefine. Retrieved from https://github.com/OpenRefine/OpenRefine https://github.com/OpenRefine/OpenRefine (accessed May 2017).
[40]
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Communications of the ACM 45, 4 (April 2002), 211--218.
[41]
Richard A. Powell and Helen M. Single. 1996. Focus groups. International Journal for Quality in Health Care 8, 5 (1996), 499--504.
[42]
R Core Team. 2017. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www.R-project.org.
[43]
Erhard Rahm and Hong-Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23, 4 (March 2000), 3--13.
[44]
Ramana Rao and Stuart Card. 1994. The table lens: Merging graphical and symbolic representations in an interactive focus + context visualization for tabular information. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, 318--322.
[45]
Shazia Sadiq (Ed.). 2013. Handbook of Data Quality. Springer Verlag, Berlin.
[46]
Margrit Schreier. 2012. Qualitative Content Analysis in Practice. SAGE Publications.
[47]
Michael Sedlmair, Miriah Meyer, and Tamara Munzner. 2012. Design study methodology: Reflections from the trenches and the stacks. IEEE Transactions on Visualization and Computer Graphics (Proc. InfoVis) 18, 12 (2012), 2431--2440.
[48]
Awalin Sopan, Manuel Freire, Meirav Taieb-Maimon, Catherine Plaisant, Jennifer Golbeck, and Ben Shneiderman. 2013. Exploring data distributions: Visual design and evaluation. International Journal on Human Computer Interaction 29, 2 (2013), 77--95.
[49]
Alvin Tarrell, Ann Fruhling, Rita Borgo, Camilla Forsell, Georges Grinstein, and Jean Scholtz. 2014. Toward visualization-specific heuristic evaluation. In Proceedings of the 5th Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV’14). ACM, 110--117.
[50]
Melanie Tory and Torsten Möller. 2004. Human factors in visualization research. IEEE Transactions on Visualization and Computer Graphics 10, 1 (2004), 72--84.
[51]
Trifacta. 2016. Trifacta Wrangler. Retrieved from https://www.trifacta.com/trifacta-wrangler/.
[52]
Edward R. Tufte. 2006. Beautiful Evidence. Graphics Press.
[53]
Torre Zuk, Lothar Schlesier, Petra Neumann, Mark S. Hancock, and Sheelagh Carpendale. 2006. Heuristics for information visualization evaluation. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization (BELIV’06). ACM, 1--6.

Cited By

View all
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
  • (2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 10, Issue 1
Challenge Paper and Research Papers
March 2018
74 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3229521
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Received: 01 July 2018
Published: 29 May 2018
Accepted: 01 February 2018
Revised: 01 January 2018
Published in JDIQ Volume 10, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data profiling
  2. data quality metrics
  3. visual exploration

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Centre for Visual Analytics Science and Technology (CVAST)
  • Austrian Federal Ministry of Science, Research, and Economy in the exceptional Laura Bassi Centres of Excellence initiative

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)10
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AIProceedings of the 36th International Conference on Scientific and Statistical Database Management10.1145/3676288.3676296(1-12)Online publication date: 10-Jul-2024
  • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
  • (2024)Data Guards: Challenges and Solutions for Fostering Trust in Data2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00019(56-60)Online publication date: 13-Oct-2024
  • (2024)Towards Better Modeling With Missing Data: A Contrastive Learning-Based Visual Analytics PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328521030:8(5129-5146)Online publication date: 1-Aug-2024
  • (2024)Data Asset Quality Evaluation Framework Based on a Hybrid Multi‐Criteria Decision‐Making MethodQuality and Reliability Engineering International10.1002/qre.3692Online publication date: 17-Nov-2024
  • (2023)Exploring the Impact of Data Quality on Business Performance in CRM Systems for Home Appliance BusinessIEEE Access10.1109/ACCESS.2023.332589211(116076-116089)Online publication date: 2023
  • (2023)A survey on dataset quality in machine learningInformation and Software Technology10.1016/j.infsof.2023.107268162(107268)Online publication date: Oct-2023
  • (2022)An Advanced Big Data Quality Framework Based on Weighted MetricsBig Data and Cognitive Computing10.3390/bdcc60401536:4(153)Online publication date: 9-Dec-2022
  • (2022)A Survey of Data Quality Measurement and Monitoring ToolsFrontiers in Big Data10.3389/fdata.2022.8506115Online publication date: 31-Mar-2022
  • (2022)A visual analytics approach for the assessment of information quality of performance models—a software reviewScientometrics10.1007/s11192-022-04399-2127:12(6827-6853)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media