skip to main content
10.1145/2187980.2187991acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Data extraction from web pages based on structural-semantic entropy

Published:16 April 2012Publication History

ABSTRACT

Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.

References

  1. Chakrabarti, S., Berg, Van den M., and Dom, B. Focused crawling: a new approach to topic-specific web resource discovery. In Proceedings of the World Wide Web (WWW'99). 1999, 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. A brief survey of web data extraction tools. SIGMOD Record, 31(2), 2002, 84--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 2006, 1411--1427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Crescenzi, V., and Mecca, G. Grammars have exceptions. Information Systems, 23(8), 1998, 539--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hong, T. W., and Clark, K. L. Using grammatical inference to automate information extraction from the web. In Proceedings of the European Conference on Principles of Knowledge Discovery in Databases (PKDD'01). 2001, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chang, C.-H., and Lui, S.-C. IEPAD: information extraction based on pattern discovery. In Proceedings of the World Wide Web (WWW'01). 2001, 681--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Crescenzi, V., Mecca, G., and Merialdo, P. RoadRunner: towards automatic data extraction from large web sites. In Proceedings of the 27th Very Large Data Bases Conference (VLDB'01). 2001, 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Crescenzi, V., Mecca, G., and Merialdo, P. Automatic web information extraction in the ROADRUNNER system. Conceptual Modelling for New Information Systems Technologies. Lecture Notes in Computer Science. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Álvarez, M., Pan, A., Raposo, J., Bellas, F., and Cacheda, F. Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering, 64, 2008, 491--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31, 1999, 227--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence, 165, 2005, 91--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Vadrevu, S., Gelgi, F., and Davulcu, H. Information Extraction from web pages using presentation regularities and domain knowledge. In Proceedings of the World Wide Web (WWW'07). Springer, 2007, 157--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wong, T.-L., and Lam, W. An unsupervised method for joint information extraction and feature mining across different Web site. Data & Knowledge Engineering, 68, 2009, 107--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Probst, K., Ghani, R., Krema, M., and Fano, A. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI'07). 2007, 2838--2842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cohen, W. W., Hurst, M., and Jensen, L. S. A flexible learning system for wrapping tables and lists in HTML Documents. In Proceedings of the World Wide Web (WWW'02). 2002, 232--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hammer, J., McHugh, J., and Garcia-Molina, H. Semistructured data: the TSIMMIS experience. In Proceedings of the East-European Symposium on Advances in Databases and Information Systems (ADBIS'97). 1997,1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chang C.-H., and Kuo, S.-C. OLERA: a semisupervised approach for Web data extraction with visual support. IEEE Intelligent Systems, 19(6), 2004, 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wang, J., and Lochovsky, F. H. Data extraction and label assignment for web databases. In Proceedings of the World Wide Web (WWW'03), 2003, 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conferences on Management of Data (SIGMOD'03), 2003, 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhai, Y. H., and Lui, B. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 2006, 1614--1628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andersen, K. A., and Hooker, J. N. A linear programming framework for logics of uncertainty. Decision Support Systems, 16, 1996, 39--53. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data extraction from web pages based on structural-semantic entropy

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
          April 2012
          1250 pages
          ISBN:9781450312301
          DOI:10.1145/2187980

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 April 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader