ABSTRACT
Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).
- R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
- M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. Proc. VLDB Endow., 6(10):805--816, Aug. 2013. Google ScholarDigital Library
- S. L. Chuang, K. C. C. Chang, and C. Zhai. Collaborative wrapping: A turbo framework for web data extraction. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 1261--1262, April 2007.Google ScholarCross Ref
- S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 699--710. VLDB Endowment, 2007. Google ScholarDigital Library
- N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
- N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarDigital Library
- N. Derouiche, B. Cautis, and T. Abdessalem. Automatic extraction of structured web data with domain knowledge. In ICDE, pages 726--737, 2012. Google ScholarDigital Library
- B. Fazzinga, S. Flesca, and A. Tagarelli. Learning robust web wrappers. In DEXA, pages 736--745, 2005. Google ScholarDigital Library
- B. Fazzinga, S. Flesca, and A. Tagarelli. Schema-based web wrapping. Knowl. Inf. Syst., 26(1):127--173, 2011.Google ScholarDigital Library
- E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70(0):301 -- 323, 2014. Google ScholarDigital Library
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363--370, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: A brief survey. AI Commun., 17(2):57--61, 2004. Google ScholarDigital Library
- T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014. Google ScholarDigital Library
- G. Gottlob, C. Koch, R. Pichler, and L. Segoufin. The complexity of XPath query evaluation and XML typing. J. ACM, 52(2):284--335, 2005. Google ScholarDigital Library
- P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In WWW, pages 1105--1106, 2010. Google ScholarDigital Library
- W.-S. Han, W. Kwak, H. Yu, J.-H. Lee, and M.-S. Kim. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci., 261:132--148, 2014. Google ScholarDigital Library
- C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, pages 441--450, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pages 729--737, 1997.Google Scholar
- J. Lehmann, T. Furche, G. Grasso, A.-C. N. Ngomo, C. Schallhart, A. Sellers, C. Unger, L. Bühmann, D. Gerber, D. L. Konrad Höffner and, and S. Auer. DEQA: Deep Web Extraction for Question Answering. In ISWC, pages 131--147, 2012. Google ScholarDigital Library
- W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3):447--460, 2010. Google ScholarDigital Library
- A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarDigital Library
- J. Raposo, A. Pan, M. Álvarez, and J. Hidalgo. Automatically maintaining wrappers for semi-structured web sources. Data Knowl. Eng., 61(2):331--358, 2007. Google ScholarDigital Library
- W. Su, J. Wang, and F. H. Lochovsky. ODE: Ontology-Assisted Data Extraction. TODS, 34(2), 2009. Google ScholarDigital Library
Index Terms
- Robust and Noise Resistant Wrapper Induction
Recommendations
Leveraging spatial join for robust tuple extraction from web pages
Extracting tuples from HTML pages has been an important issue in various web applications. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find ...
Hierarchical Wrapper Induction for Semistructured Information Sources
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that ...
Noise Statistics Oblivious GARD For Robust Regression With Sparse Outliers
Linear regression models contaminated by Gaussian noise (inlier) and possibly unbounded sparse outliers are common in many signal processing applications. Sparse recovery inspired robust regression (SRIRR) techniques are shown to deliver high-quality ...
Comments