research-article

Robust and Noise Resistant Wrapper Induction

Authors:
Tim Furche

Oxford University, Oxford, United Kingdom

Oxford University, Oxford, United Kingdom
View Profile

,
Jinsong Guo

Oxford University, Oxford, United Kingdom

Oxford University, Oxford, United Kingdom
View Profile

,
Sebastian Maneth

University of Edinburgh, Edinburgh, United Kingdom

University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Christian Schallhart

Oxford University, Oxford, United Kingdom

Oxford University, Oxford, United Kingdom
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 773–784https://doi.org/10.1145/2882903.2915214

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 773–784

ABSTRACT

Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).

References

R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. Proc. VLDB Endow., 6(10):805--816, Aug. 2013. Google ScholarDigital Library
S. L. Chuang, K. C. C. Chang, and C. Zhai. Collaborative wrapping: A turbo framework for web data extraction. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 1261--1262, April 2007.Google ScholarCross Ref
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 699--710. VLDB Endowment, 2007. Google ScholarDigital Library
N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarDigital Library
N. Derouiche, B. Cautis, and T. Abdessalem. Automatic extraction of structured web data with domain knowledge. In ICDE, pages 726--737, 2012. Google ScholarDigital Library
B. Fazzinga, S. Flesca, and A. Tagarelli. Learning robust web wrappers. In DEXA, pages 736--745, 2005. Google ScholarDigital Library
B. Fazzinga, S. Flesca, and A. Tagarelli. Schema-based web wrapping. Knowl. Inf. Syst., 26(1):127--173, 2011.Google ScholarDigital Library
E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70(0):301 -- 323, 2014. Google ScholarDigital Library
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: A brief survey. AI Commun., 17(2):57--61, 2004. Google ScholarDigital Library
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarDigital Library
G. Gottlob, C. Koch, R. Pichler, and L. Segoufin. The complexity of XPath query evaluation and XML typing. J. ACM, 52(2):284--335, 2005. Google ScholarDigital Library
P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In WWW, pages 1105--1106, 2010. Google ScholarDigital Library
W.-S. Han, W. Kwak, H. Yu, J.-H. Lee, and M.-S. Kim. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci., 261:132--148, 2014. Google ScholarDigital Library
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pages 729--737, 1997.Google Scholar
J. Lehmann, T. Furche, G. Grasso, A.-C. N. Ngomo, C. Schallhart, A. Sellers, C. Unger, L. Bühmann, D. Gerber, D. L. Konrad Höffner and, and S. Auer. DEQA: Deep Web Extraction for Question Answering. In ISWC, pages 131--147, 2012. Google ScholarDigital Library
W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3):447--460, 2010. Google ScholarDigital Library
A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarDigital Library
J. Raposo, A. Pan, M. Álvarez, and J. Hidalgo. Automatically maintaining wrappers for semi-structured web sources. Data Knowl. Eng., 61(2):331--358, 2007. Google ScholarDigital Library
W. Su, J. Wang, and F. H. Lochovsky. ODE: Ontology-Assisted Data Extraction. TODS, 34(2), 2009. Google ScholarDigital Library

Index Terms

Robust and Noise Resistant Wrapper Induction
1. Information systems
  1. Data management systems
    1. Information integration
      1. Wrappers (data mining)
    2. Query languages
      1. XML query languages
        XPath
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration
        Deep web

Recommendations

Leveraging spatial join for robust tuple extraction from web pages

Extracting tuples from HTML pages has been an important issue in various web applications. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find ...
Read More
Hierarchical Wrapper Induction for Semistructured Information Sources

With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that ...
Read More
Noise Statistics Oblivious GARD For Robust Regression With Sparse Outliers
Linear regression models contaminated by Gaussian noise (inlier) and possibly unbounded sparse outliers are common in many signal processing applications. Sparse recovery inspired robust regression (SRIRR) techniques are shown to deliver high-quality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XPath
wrapper
wrapper induction
wrapper maintenance
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 404
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust and Noise Resistant Wrapper Induction

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Leveraging spatial join for robust tuple extraction from web pages

Hierarchical Wrapper Induction for Semistructured Information Sources

Noise Statistics Oblivious GARD For Robust Regression With Sparse Outliers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Robust and Noise Resistant Wrapper Induction

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Leveraging spatial join for robust tuple extraction from web pages

Hierarchical Wrapper Induction for Semistructured Information Sources

Noise Statistics Oblivious GARD For Robust Regression With Sparse Outliers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media