skip to main content
10.1145/1772690.1772826acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

Exploiting content redundancy for web information extraction

Published: 26 April 2010 Publication History

Abstract

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.

Reference

[1]
L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web data integration. In WWW, 2003.

Cited By

View all
  • (2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
  • (2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
  • (2018)Big Data Linkage for Product Specification PagesProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183757(67-81)Online publication date: 27-May-2018
  • Show More Cited By

Index Terms

  1. Exploiting content redundancy for web information extraction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '10: Proceedings of the 19th international conference on World wide web
    April 2010
    1407 pages
    ISBN:9781605587998
    DOI:10.1145/1772690

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. content redundancy
    2. information extraction

    Qualifiers

    • Poster

    Conference

    WWW '10
    WWW '10: The 19th International World Wide Web Conference
    April 26 - 30, 2010
    North Carolina, Raleigh, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
    • (2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
    • (2018)Big Data Linkage for Product Specification PagesProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183757(67-81)Online publication date: 27-May-2018
    • (2017)R-Extractor: A Method for Data Extraction from Template-Based Entity-Pages2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC)10.1109/COMPSAC.2017.202(778-787)Online publication date: Jul-2017
    • (2017)Orion: A Cypher-Based Web Data ExtractorDatabase and Expert Systems Applications10.1007/978-3-319-64468-4_21(275-289)Online publication date: 1-Aug-2017
    • (2016)Robust and Noise Resistant Wrapper InductionProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915214(773-784)Online publication date: 26-Jun-2016
    • (2016)MAVEIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.257330228:9(2393-2406)Online publication date: 1-Sep-2016
    • (2015)IBEXProceedings of the 18th International Workshop on Web and Databases10.1145/2767109.2767116(13-19)Online publication date: 31-May-2015
    • (2014)DIADEMProceedings of the VLDB Endowment10.14778/2733085.27330917:14(1845-1856)Online publication date: 1-Oct-2014
    • (2014)An analysis of duplicate on web extracted objectsProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579708(1279-1284)Online publication date: 7-Apr-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    EPUB

    View this article in ePub.

    ePub

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media