ABSTRACT
Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.
- Chakrabarti, S., Berg, Van den M., and Dom, B. Focused crawling: a new approach to topic-specific web resource discovery. In Proceedings of the World Wide Web (WWW'99). 1999, 1623--1640. Google ScholarDigital Library
- Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. A brief survey of web data extraction tools. SIGMOD Record, 31(2), 2002, 84--92. Google ScholarDigital Library
- Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 2006, 1411--1427. Google ScholarDigital Library
- Crescenzi, V., and Mecca, G. Grammars have exceptions. Information Systems, 23(8), 1998, 539--565. Google ScholarDigital Library
- Hong, T. W., and Clark, K. L. Using grammatical inference to automate information extraction from the web. In Proceedings of the European Conference on Principles of Knowledge Discovery in Databases (PKDD'01). 2001, 216--227. Google ScholarDigital Library
- Chang, C.-H., and Lui, S.-C. IEPAD: information extraction based on pattern discovery. In Proceedings of the World Wide Web (WWW'01). 2001, 681--687. Google ScholarDigital Library
- Crescenzi, V., Mecca, G., and Merialdo, P. RoadRunner: towards automatic data extraction from large web sites. In Proceedings of the 27th Very Large Data Bases Conference (VLDB'01). 2001, 109--118. Google ScholarDigital Library
- Crescenzi, V., Mecca, G., and Merialdo, P. Automatic web information extraction in the ROADRUNNER system. Conceptual Modelling for New Information Systems Technologies. Lecture Notes in Computer Science. Springer, 2003. Google ScholarDigital Library
- Álvarez, M., Pan, A., Raposo, J., Bellas, F., and Cacheda, F. Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering, 64, 2008, 491--509. Google ScholarDigital Library
- Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31, 1999, 227--251. Google ScholarDigital Library
- Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence, 165, 2005, 91--134. Google ScholarDigital Library
- Vadrevu, S., Gelgi, F., and Davulcu, H. Information Extraction from web pages using presentation regularities and domain knowledge. In Proceedings of the World Wide Web (WWW'07). Springer, 2007, 157--179. Google ScholarDigital Library
- Wong, T.-L., and Lam, W. An unsupervised method for joint information extraction and feature mining across different Web site. Data & Knowledge Engineering, 68, 2009, 107--125. Google ScholarDigital Library
- Probst, K., Ghani, R., Krema, M., and Fano, A. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI'07). 2007, 2838--2842. Google ScholarDigital Library
- Cohen, W. W., Hurst, M., and Jensen, L. S. A flexible learning system for wrapping tables and lists in HTML Documents. In Proceedings of the World Wide Web (WWW'02). 2002, 232--241. Google ScholarDigital Library
- Hammer, J., McHugh, J., and Garcia-Molina, H. Semistructured data: the TSIMMIS experience. In Proceedings of the East-European Symposium on Advances in Databases and Information Systems (ADBIS'97). 1997,1--8. Google ScholarDigital Library
- Chang C.-H., and Kuo, S.-C. OLERA: a semisupervised approach for Web data extraction with visual support. IEEE Intelligent Systems, 19(6), 2004, 56--64. Google ScholarDigital Library
- Wang, J., and Lochovsky, F. H. Data extraction and label assignment for web databases. In Proceedings of the World Wide Web (WWW'03), 2003, 187--196. Google ScholarDigital Library
- Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conferences on Management of Data (SIGMOD'03), 2003, 337--348. Google ScholarDigital Library
- Zhai, Y. H., and Lui, B. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 2006, 1614--1628. Google ScholarDigital Library
- Andersen, K. A., and Hooker, J. N. A linear programming framework for logics of uncertainty. Decision Support Systems, 16, 1996, 39--53. Google ScholarDigital Library
Index Terms
- Data extraction from web pages based on structural-semantic entropy
Recommendations
Information Extraction from Web Pages
WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03We present a chain of techniques for extraction of object attribute data from web pages which contain either multiple object data or detailed data about a single object. We discover data regions containing multiple data records, which will be extracted ...
Data extraction and label assignment for web databases
WWW '03: Proceedings of the 12th international conference on World Wide WebMany tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved ...
Information extraction from web pages based on their visual representation
ICWE'11: Proceedings of the 11th international conference on Current Trends in Web EngineeringThis research is dedicated to enhancing the efficiency of web information extraction and web accessibility. The motivation behind the research, its aim and objectives are presented, and the performed work on developing web page model for information ...
Comments