skip to main content
10.1145/1935826.1935894acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Collective extraction from heterogeneous web lists

Published: 09 February 2011 Publication History

Abstract

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites.
We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

References

[1]
E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20--29, 2004.
[2]
M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Engg., 2008.
[3]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. ACM, 2003.
[4]
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. SIGMOD Rec., 30(2), 2001.
[5]
S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In EMNLP, pages 827--836, 2007.
[6]
C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng., 2006.
[7]
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In VLDB, 2007.
[8]
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 2000.
[9]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.
[10]
P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007.
[11]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In Proceedings of the VLDB Endowment (PVLDB), pages 1078--1089, 2009.
[12]
P. Gulhane, R. Rastogi, S. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In VLDB, 2010.
[13]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009.
[14]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.
[15]
I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 29, Washington, DC, USA, 2006.
[16]
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Bio., 1970.
[17]
P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.
[18]
A. Rajaraman. Kosmix: Exploring the deep web using taxonomies and categorization. IEEE Data Eng. Bull., 32(2):12--19, 2009.
[19]
P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In UAI '04: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 454--461, 2004.
[20]
C. Sutton and A. Mccallum. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, chapter 4. MIT Press, 2007.
[21]
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 1967.
[22]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW. ACM, 2005.
[23]
J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, 2006.

Cited By

View all
  • (2020)A Similarity Function for HTML ListsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430963(309-316)Online publication date: 30-Nov-2020
  • (2018)Navigating the Data Lake with DATAMARANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183746(943-958)Online publication date: 27-May-2018
  • (2016)Joint repairs for web wrappers2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498320(1146-1157)Online publication date: May-2016
  • Show More Cited By

Index Terms

  1. Collective extraction from heterogeneous web lists

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
    February 2011
    870 pages
    ISBN:9781450304931
    DOI:10.1145/1935826
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 February 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. collective bayesian models
    2. hidden markov models
    3. incremental
    4. information extraction

    Qualifiers

    • Poster

    Conference

    Acceptance Rates

    WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)A Similarity Function for HTML ListsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430963(309-316)Online publication date: 30-Nov-2020
    • (2018)Navigating the Data Lake with DATAMARANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183746(943-958)Online publication date: 27-May-2018
    • (2016)Joint repairs for web wrappers2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498320(1146-1157)Online publication date: May-2016
    • (2015)TEGRAProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2723725(1713-1728)Online publication date: 27-May-2015
    • (2014)TrinityIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.16126:6(1544-1556)Online publication date: 1-Jun-2014
    • (2013)Knowledge harvesting in the big-data eraProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463724(933-938)Online publication date: 22-Jun-2013
    • (2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
    • (2013)Knowledge harvesting from text and Web sourcesProceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)10.1109/ICDE.2013.6544916(1250-1253)Online publication date: 8-Apr-2013
    • (2013)TEXKnowledge-Based Systems10.1016/j.knosys.2012.10.00939(109-123)Online publication date: 1-Feb-2013
    • (2012)An analysis of structured data on the webProceedings of the VLDB Endowment10.14778/2180912.21809205:7(680-691)Online publication date: 1-Mar-2012
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media