ABSTRACT
Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices.
Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window.
Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency.
Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.
- P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarDigital Library
- G. E. P. Box. Statistics for experimenters: an introduction to design, data analysis, and model building. 1978.Google Scholar
- D. R. Brillinger. Time series: data analysis and theory, volume 36. Siam, 2001. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
- P. M. Fischer, K. S. Esmaili, and R. J. Miller. Stream schema: providing and exploiting static metadata for data stream processing. In EDBT, pages 207--218, 2010. Google ScholarDigital Library
- Full Version. http://ise.thss.tsinghua.edu.cn/sxsong/doc/screen.pdf.Google Scholar
- E. S. Gardner Jr. Exponential smoothing: The state of the art--part ii. International Journal of Forecasting, 22(4):637--666, 2006.Google ScholarCross Ref
- L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574--585, 2009. Google ScholarDigital Library
- C. A. R. Hoare. Quicksort. Comput. J., 5(1):10--15, 1962.Google ScholarCross Ref
- S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive cleaning for rfid data streams. In VLDB, pages 163--174, 2006. Google ScholarDigital Library
- H. Z. Jerrold. Biostatistical analysis. Biostatistical analysis, 1999.Google Scholar
- N. Karmarkar. A new polynomial-time algorithm for linear programming. In STOC, pages 302--311, 1984. Google ScholarDigital Library
- E. J. Keogh, S. Chu, D. M. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289--296, 2001. Google ScholarDigital Library
- S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
- X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012. Google ScholarDigital Library
- M. Liu, M. Li, D. Golovnya, E. A. Rundensteiner, and K. T. Claypool. Sequence pattern query processing over out-of-order event streams. In ICDE, pages 784--795, 2009. Google ScholarDigital Library
- A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarCross Ref
Index Terms
- SCREEN: Stream Data Cleaning under Speed Constraints
Recommendations
Stream Data Cleaning under Speed and Acceleration Constraints
Stream data are often dirty, for example, owing to unreliable sensor reading or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original ...
Constraint-Variance Tolerant Data Repairing
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataIntegrity constraints, guiding the cleaning of dirty data, are often found to be imprecise as well. Existing studies consider the inaccurate constraints that are oversimplified, and thus refine the constraints via inserting more predicates (attributes). ...
On-chain repairing for multi-party data migration
AbstractBlockchain can be used to solve the problem of mutual trust between different institutions. However, when migrating data from a traditional system to a blockchain system, the order of data transactions is difficult to determine and the data ...
Comments