ABSTRACT
Poor data quality leads to unreliable results of any kind of data processing and has profound economic impact. Although there are tools to help users with the task of data cleansing, support for dealing with the specifics of time-oriented data is rather poor. However, the time dimension has very specific characteristics which introduce quality problems, that are different from other kinds of data. We present TimeCleanser, an interactive Visual Analytics system to support the task of data cleansing of time-oriented data. In order to help the user to deal with these special characteristics and quality problems, TimeCleanser combines semi-automatic quality checks, visualizations, and directly editable data tables. The evaluation of the TimeCleanser system within a focus group (two target users, one developer, and two Human Computer Interaction experts) shows that (a) our proposed method is suited to detect hidden quality problems of time-oriented data and (b) that it facilitates the complex task of data cleansing.
- J. Barateiro and H. Galhardas. A survey of data quality tools. Datenbankspektrum, 14:15--21, August 2005.Google Scholar
- J. Bernard, T. Ruppert, O. Goroll, T. May, and J. Kohlhammer. Visual-Interactive preprocessing of time series data. In Proc. of SIGRAD 2012: Interactive Visual Analysis of Data, pages 39--48, November 2012.Google Scholar
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An extensible data cleaning tool. SIGMOD Record, 29(2):590--596, June 2000. Google ScholarDigital Library
- T. Gschwandtner, J. Gärtner, W. Aigner, and S. Miksch. A taxonomy of dirty time-oriented data. In G. Quirchmayr, J. Basl, I. You, L. Xu, and E. Weippl, editors, Multidisciplinary Research and Practice for Information Systems, LNCS 7465, pages 58--72. Springer, Berlin/Heidelberg, Germany, 2012.Google Scholar
- R. P. Jagadeesh Chandra Bose, R. S. Mans, and W. M. P. van der Aalst. Wanna improve process mining results? It's high time we consider data quality issues seriously. In Proc. of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2013), pages 127--134, April 2013.Google Scholar
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proc. of the ACM Conference Human Factors in Computing Systems (CHI 2011), pages 3363--3372, May 2011. Google ScholarDigital Library
- S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proc. of the International Working Conference on Advanced Visual Interfaces (AVI'12), pages 547--554, May 2012. Google ScholarDigital Library
- D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. Visual analytics: Scope and challenges. In S. J. Simoff, M. H. Böhlen, and A. Mazeika, editors, Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, LNCS 4404, pages 76--90. Springer, Berlin/Heidelberg, Germany, 2008. Google ScholarDigital Library
- W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7(1):81--99, January 2003. Google ScholarDigital Library
- Microsoft. Excel. office.microsoft.com/en-us/excel/ (accessed: 2014-04-17).Google Scholar
- H. Müller and J.-C. Freytag. HUB-IB-164. Problems, methods, and challenges in comprehensive data cleansing. Technical report, Humboldt University Berlin, 2003.Google Scholar
- T. Munzner. A nested model for visualization design and validation. IEEE Transactions on Visualization and Computer Graphics, 15(6):921--928, November 2009. Google ScholarDigital Library
- P. Oliveira, F. Rodrigues, and P. Henriques. A formal definition of data quality problems. In Proc. of the International Conference on Information Quality (MIT IQ Conference), November 2005.Google Scholar
- Original German quotes of the focus group session. Attached to the submission as supplemental material. ieg.ifs.tuwien.ac.at/~gschwandtner/material/quotes.pdf (accessed: 2014-04-17).Google Scholar
- E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering, 23(4):3--13, March 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proc. of the 27th International Conference on Very Large Data Bases, pages 381--390, September 2001. Google ScholarDigital Library
- Random Developers. OpenRefine. http://openrefine.org/ (accessed: 2014-04-17).Google Scholar
- J. Scholtz, M. A. Whiting, C. Plaisant, and G. Grinstein. A reflection on seven years of the VAST challenge. In Proc. of the 2012 BELIV Workshop: Beyond Time and Errors - Novel Evaluation Methods for Visualization, pages 13:1--13:8, October 2012. Google ScholarDigital Library
- M. Sedlmair, M. Meyer, and T. Munzner. Design study methodology: Reections from the trenches and the stacks. IEEE Trans. Visualization and Computer Graphics, 18(12):2431--2440, October 2012.Google ScholarDigital Library
- B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Proc. of the 1996 IEEE Symposium on Visual Languages, pages 336--343, September 1996. Google ScholarDigital Library
- Talend. Profiler. http://www.talend.com/ (accessed: 2014-04-17).Google Scholar
- XIMES GmbH. Time Intelligence Solutions {TIS}. www.ximes.com/en/software/products/tis/ (accessed: 2014-04-17).Google Scholar
- Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interesting locations and travel sequences from GPS trajectories. In Proc. of the International Conference on World Wild Web (WWW 2009), pages 791--800, April 2009. Google ScholarDigital Library
Index Terms
- TimeCleanser: a visual analytics approach for data cleansing of time-oriented data
Recommendations
A Review on Data Cleansing Methods for Big Data
AbstractMassive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
A Taxonomy of Dirty Data
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
Visualizing time-oriented data-A systematic view
The analysis of time-oriented data is an important task in many application scenarios. In recent years, a variety of techniques for visualizing such data have been published. This variety makes it difficult for prospective users to select methods or ...
Comments