poster

Collective extraction from heterogeneous web lists

Authors:

Ashwin Machanavajjhala,

Arun Shankar Iyer,

Philip Bohannon,

Srujana MeruguAuthors Info & Claims

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 445 - 454

https://doi.org/10.1145/1935826.1935894

Published: 09 February 2011 Publication History

Abstract

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites.

We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

References

[1]

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20--29, 2004.

Digital Library

[2]

M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Engg., 2008.

Digital Library

[3]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. ACM, 2003.

Digital Library

[4]

V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. SIGMOD Rec., 30(2), 2001.

Digital Library

[5]

S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In EMNLP, pages 827--836, 2007.

[6]

C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng., 2006.

Digital Library

[7]

S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In VLDB, 2007.

Digital Library

[8]

W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 2000.

Digital Library

[9]

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.

Digital Library

[10]

P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007.

Digital Library

[11]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In Proceedings of the VLDB Endowment (PVLDB), pages 1078--1089, 2009.

Digital Library

[12]

P. Gulhane, R. Rastogi, S. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In VLDB, 2010.

Digital Library

[13]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009.

Digital Library

[14]

N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.

Digital Library

[15]

I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 29, Washington, DC, USA, 2006.

Digital Library

[16]

S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Bio., 1970.

[17]

P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.

[18]

A. Rajaraman. Kosmix: Exploring the deep web using taxonomies and categorization. IEEE Data Eng. Bull., 32(2):12--19, 2009.

[19]

P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In UAI '04: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 454--461, 2004.

Digital Library

[20]

C. Sutton and A. Mccallum. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, chapter 4. MIT Press, 2007.

[21]

A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 1967.

Digital Library

[22]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW. ACM, 2005.

Digital Library

[23]

J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, 2006.

Digital Library

Cited By

Venâncio FMello Rde Salles Soares Neto C(2020)A Similarity Function for HTML ListsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430963(309-316)Online publication date: 30-Nov-2020
https://dl.acm.org/doi/10.1145/3428658.3430963
Gao YHuang SParameswaran ADas GJermaine CBernstein P(2018)Navigating the Data Lake with DATAMARANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183746(943-958)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183746
Ortona SOrsi GFurche TBuoncristiano M(2016)Joint repairs for web wrappers2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498320(1146-1157)Online publication date: May-2016
https://doi.org/10.1109/ICDE.2016.7498320
Show More Cited By

Index Terms

Collective extraction from heterogeneous web lists
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Unsupervised named-entity extraction from the Web: An experimental study

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of ...
A robust web personal name information extraction system

Highlights Features are extracted with various lightweight methods and from broad resources. The unsupervised features improve the robustness of a disambiguation system. Our AE system integrates various extraction approaches with high precision. Each ...
Information extraction meets the Semantic Web: A survey

We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

February 2011

870 pages

ISBN:9781450304931

DOI:10.1145/1935826

General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

WSDM'11

Sponsor:

WSDM'11: Fourth ACM International Conference on Web Search and Data Mining

February 9 - 12, 2011

Hong Kong, China

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
330
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Venâncio FMello Rde Salles Soares Neto C(2020)A Similarity Function for HTML ListsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430963(309-316)Online publication date: 30-Nov-2020
https://dl.acm.org/doi/10.1145/3428658.3430963
Gao YHuang SParameswaran ADas GJermaine CBernstein P(2018)Navigating the Data Lake with DATAMARANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183746(943-958)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183746
Ortona SOrsi GFurche TBuoncristiano M(2016)Joint repairs for web wrappers2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498320(1146-1157)Online publication date: May-2016
https://doi.org/10.1109/ICDE.2016.7498320
Chu XHe YChakrabarti KGanjam KSellis TDavidson SIves Z(2015)TEGRAProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2723725(1713-1728)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2723725
Sleiman HCorchuelo R(2014)TrinityIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.16126:6(1544-1556)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1109/TKDE.2013.161
Suchanek FWeikum GRoss KSrivastava DPapadias D(2013)Knowledge harvesting in the big-data eraProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463724(933-938)Online publication date: 22-Jun-2013
https://dl.acm.org/doi/10.1145/2463676.2463724
Sleiman HCorchuelo R(2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TKDE.2012.135
Weikum GSuchanek F(2013)Knowledge harvesting from text and Web sourcesProceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)10.1109/ICDE.2013.6544916(1250-1253)Online publication date: 8-Apr-2013
https://dl.acm.org/doi/10.1109/ICDE.2013.6544916
Sleiman HCorchuelo R(2013)TEXKnowledge-Based Systems10.1016/j.knosys.2012.10.00939(109-123)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1016/j.knosys.2012.10.009
Dalvi NMachanavajjhala APang B(2012)An analysis of structured data on the webProceedings of the VLDB Endowment10.14778/2180912.21809205:7(680-691)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.14778/2180912.2180920
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten