skip to main content
10.1145/2723372.2723730acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

SCREEN: Stream Data Cleaning under Speed Constraints

Published:27 May 2015Publication History

ABSTRACT

Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices.

Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window.

Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency.

Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.

References

  1. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. E. P. Box. Statistics for experimenters: an introduction to design, data analysis, and model building. 1978.Google ScholarGoogle Scholar
  3. D. R. Brillinger. Time series: data analysis and theory, volume 36. Siam, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. M. Fischer, K. S. Esmaili, and R. J. Miller. Stream schema: providing and exploiting static metadata for data stream processing. In EDBT, pages 207--218, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Full Version. http://ise.thss.tsinghua.edu.cn/sxsong/doc/screen.pdf.Google ScholarGoogle Scholar
  7. E. S. Gardner Jr. Exponential smoothing: The state of the art--part ii. International Journal of Forecasting, 22(4):637--666, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574--585, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. A. R. Hoare. Quicksort. Comput. J., 5(1):10--15, 1962.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive cleaning for rfid data streams. In VLDB, pages 163--174, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Z. Jerrold. Biostatistical analysis. Biostatistical analysis, 1999.Google ScholarGoogle Scholar
  12. N. Karmarkar. A new polynomial-time algorithm for linear programming. In STOC, pages 302--311, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. J. Keogh, S. Chu, D. M. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289--296, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Liu, M. Li, D. Golovnya, E. A. Rundensteiner, and K. T. Claypool. Sequence pattern query processing over out-of-order event streams. In ICDE, pages 784--795, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SCREEN: Stream Data Cleaning under Speed Constraints

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader