research-article

Data extraction from web pages based on structural-semantic entropy

Authors:
Xiaoqing Zheng

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Yiling Gu

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Yinsheng Li

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebApril 2012Pages 93–102https://doi.org/10.1145/2187980.2187991

Published:16 April 2012Publication History

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Pages 93–102

ABSTRACT

Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.

References

Chakrabarti, S., Berg, Van den M., and Dom, B. Focused crawling: a new approach to topic-specific web resource discovery. In Proceedings of the World Wide Web (WWW'99). 1999, 1623--1640. Google ScholarDigital Library
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. A brief survey of web data extraction tools. SIGMOD Record, 31(2), 2002, 84--92. Google ScholarDigital Library
Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 2006, 1411--1427. Google ScholarDigital Library
Crescenzi, V., and Mecca, G. Grammars have exceptions. Information Systems, 23(8), 1998, 539--565. Google ScholarDigital Library
Hong, T. W., and Clark, K. L. Using grammatical inference to automate information extraction from the web. In Proceedings of the European Conference on Principles of Knowledge Discovery in Databases (PKDD'01). 2001, 216--227. Google ScholarDigital Library
Chang, C.-H., and Lui, S.-C. IEPAD: information extraction based on pattern discovery. In Proceedings of the World Wide Web (WWW'01). 2001, 681--687. Google ScholarDigital Library
Crescenzi, V., Mecca, G., and Merialdo, P. RoadRunner: towards automatic data extraction from large web sites. In Proceedings of the 27th Very Large Data Bases Conference (VLDB'01). 2001, 109--118. Google ScholarDigital Library
Crescenzi, V., Mecca, G., and Merialdo, P. Automatic web information extraction in the ROADRUNNER system. Conceptual Modelling for New Information Systems Technologies. Lecture Notes in Computer Science. Springer, 2003. Google ScholarDigital Library
Álvarez, M., Pan, A., Raposo, J., Bellas, F., and Cacheda, F. Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering, 64, 2008, 491--509. Google ScholarDigital Library
Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31, 1999, 227--251. Google ScholarDigital Library
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence, 165, 2005, 91--134. Google ScholarDigital Library
Vadrevu, S., Gelgi, F., and Davulcu, H. Information Extraction from web pages using presentation regularities and domain knowledge. In Proceedings of the World Wide Web (WWW'07). Springer, 2007, 157--179. Google ScholarDigital Library
Wong, T.-L., and Lam, W. An unsupervised method for joint information extraction and feature mining across different Web site. Data & Knowledge Engineering, 68, 2009, 107--125. Google ScholarDigital Library
Probst, K., Ghani, R., Krema, M., and Fano, A. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI'07). 2007, 2838--2842. Google ScholarDigital Library
Cohen, W. W., Hurst, M., and Jensen, L. S. A flexible learning system for wrapping tables and lists in HTML Documents. In Proceedings of the World Wide Web (WWW'02). 2002, 232--241. Google ScholarDigital Library
Hammer, J., McHugh, J., and Garcia-Molina, H. Semistructured data: the TSIMMIS experience. In Proceedings of the East-European Symposium on Advances in Databases and Information Systems (ADBIS'97). 1997,1--8. Google ScholarDigital Library
Chang C.-H., and Kuo, S.-C. OLERA: a semisupervised approach for Web data extraction with visual support. IEEE Intelligent Systems, 19(6), 2004, 56--64. Google ScholarDigital Library
Wang, J., and Lochovsky, F. H. Data extraction and label assignment for web databases. In Proceedings of the World Wide Web (WWW'03), 2003, 187--196. Google ScholarDigital Library
Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conferences on Management of Data (SIGMOD'03), 2003, 337--348. Google ScholarDigital Library
Zhai, Y. H., and Lui, B. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 2006, 1614--1628. Google ScholarDigital Library
Andersen, K. A., and Hooker, J. N. A linear programming framework for logics of uncertainty. Decision Support Systems, 16, 1996, 39--53. Google ScholarDigital Library

Index Terms

Data extraction from web pages based on structural-semantic entropy
1. Applied computing
  1. Operations research
    1. Decision analysis
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
  2. Information systems applications
    1. Decision support systems

Recommendations

Information Extraction from Web Pages
WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03

We present a chain of techniques for extraction of object attribute data from web pages which contain either multiple object data or detailed data about a single object. We discover data regions containing multiple data records, which will be extracted ...
Read More
Data extraction and label assignment for web databases
WWW '03: Proceedings of the 12th international conference on World Wide Web

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved ...
Read More
Information extraction from web pages based on their visual representation
ICWE'11: Proceedings of the 11th international conference on Current Trends in Web Engineering

This research is dedicated to enhancing the efficiency of web information extraction and web accessibility. The motivation behind the research, its aim and objectives are presented, and the performed work on developing web page model for information ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
April 2012
1250 pages
ISBN:9781450312301
DOI:10.1145/2187980
General Chairs:
Alain Mille
Université de Lyon, France
,
Fabien Gandon
INRIA, France
,
Jacques Misselis
HP, France
,
Program Chairs:
Michael Rabinovich
Case Western Reserve University, USA
,
Steffen Staab
University of Koblenz-Landau, Germany
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 April 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
false advertisement detection
structural-semantic entropy
web information extraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 465
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data extraction from web pages based on structural-semantic entropy

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Information Extraction from Web Pages

Data extraction and label assignment for web databases

Information extraction from web pages based on their visual representation