skip to main content
10.1145/2882903.2915214acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Robust and Noise Resistant Wrapper Induction

Published:14 June 2016Publication History

ABSTRACT

Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).

References

  1. R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. Proc. VLDB Endow., 6(10):805--816, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. L. Chuang, K. C. C. Chang, and C. Zhai. Collaborative wrapping: A turbo framework for web data extraction. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 1261--1262, April 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 699--710. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Derouiche, B. Cautis, and T. Abdessalem. Automatic extraction of structured web data with domain knowledge. In ICDE, pages 726--737, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Fazzinga, S. Flesca, and A. Tagarelli. Learning robust web wrappers. In DEXA, pages 736--745, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Fazzinga, S. Flesca, and A. Tagarelli. Schema-based web wrapping. Knowl. Inf. Syst., 26(1):127--173, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70(0):301 -- 323, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: A brief survey. AI Commun., 17(2):57--61, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Gottlob, C. Koch, R. Pichler, and L. Segoufin. The complexity of XPath query evaluation and XML typing. J. ACM, 52(2):284--335, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In WWW, pages 1105--1106, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W.-S. Han, W. Kwak, H. Yu, J.-H. Lee, and M.-S. Kim. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci., 261:132--148, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pages 729--737, 1997.Google ScholarGoogle Scholar
  19. J. Lehmann, T. Furche, G. Grasso, A.-C. N. Ngomo, C. Schallhart, A. Sellers, C. Unger, L. Bühmann, D. Gerber, D. L. Konrad Höffner and, and S. Auer. DEQA: Deep Web Extraction for Question Answering. In ISWC, pages 131--147, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3):447--460, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Raposo, A. Pan, M. Álvarez, and J. Hidalgo. Automatically maintaining wrappers for semi-structured web sources. Data Knowl. Eng., 61(2):331--358, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Su, J. Wang, and F. H. Lochovsky. ODE: Ontology-Assisted Data Extraction. TODS, 34(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust and Noise Resistant Wrapper Induction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
          June 2016
          2300 pages
          ISBN:9781450335317
          DOI:10.1145/2882903

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader