research-article

Automatic extraction of clickable structured web contents for name entity queries

Authors:

Yi-Chin TuAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 991 - 1000

https://doi.org/10.1145/1772690.1772791

Published: 26 April 2010 Publication History

Abstract

Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users' search trails (i.e., post-search browsing behaviors) as implicit labels on the relevance between web contents and user queries. Based on such labels we use information extraction approach to build wrappers and extract structured information. An important observation is that many web sites contain pages for name entities of certain categories (e.g., AOL Music contains a page for each musician), and these pages have the same format. This makes it possible to build wrappers from a small amount of implicit labels, and use them to extract structured information from many web pages for different name entities. We propose STRUCLICK, a fully automated system for extracting structured information for queries containing name entities of certain categories. It can identify important web sites from web search logs, build wrappers from users' search trails, filter out bad wrappers built from random user clicks, and combine structured information from different web sites for each query. Comparing with existing approaches on information extraction, STRUCLICK can assign semantics to extracted data without any human labeling or supervision. We perform comprehensive experiments, which show STRUCLICK achieves high accuracy and good scalability.

References

[1]

Amazon Mechanical Turk. https://www.mturk.com/mturk/

[2]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. SIGMOD'03.

Digital Library

[3]

T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.

[4]

M. Bilenko, R. W. White. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. WWW'08.

Digital Library

[5]

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. IJCAI'07.

Digital Library

[6]

M. J. Cafarella, A. Halevy, N. Khoussainova. Data Integra--tion for the Relational Web. VLDB'09.

Digital Library

[7]

C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. WWW'01.

Digital Library

[8]

V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. VLDB'01.

Digital Library

[9]

S. Dill et al. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. WWW'03.

Digital Library

[10]

X. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. VLDB'09.

Digital Library

[11]

J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. SIGIR'09.

Digital Library

[12]

N. Kushmerick. Wrapper induction for information extraction. PhD thesis (1997).

Digital Library

[13]

A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of Web data extraction tools. ACM SIGMOD Record, 31(2):84--93, 2002.

Digital Library

[14]

X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. SIGIR'08.

Digital Library

[15]

B. Liu. Mining data records in Web pages. KDD'03.

Digital Library

[16]

G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. WWW'09.

Digital Library

[17]

S. Mukherjee and I.V. Ramakrishnan. Automated semantic analysis of schematic data. World Wide Web Journal. 11(4): 427--464 (2008).

Digital Library

[18]

I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. AGENTS'99.

Digital Library

[19]

M. Pa_ca. Organizing and searching the world wide web of facts - step two: harnessing the wisdom of the crowds. WWW'07.

Digital Library

[20]

X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. KDD'07.

Digital Library

[21]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. WWW'05.

Digital Library

[22]

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf. Learning with local and global consistency. NIPS'03.

[23]

D. Zhou, J. Huang, B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. ICML'05.

Digital Library

Cited By

Jiang JLiu JLin CCheng PBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search LogsProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806479(1441-1450)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806479
Saha Roy RKatare RGanguly NLaxman SChoudhury M(2015)Discovering and understanding word level user intent in Web search queriesWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2014.07.01030:C(22-38)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.websem.2014.07.010
Saha Roy RSuresh AGanguly NChoudhury MSchwabe DAlmeida VGlaser HBaeza-Yates RMoon S(2013)Place valueProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487862(153-154)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1145/2487788.2487862
Show More Cited By

Index Terms

Automatic extraction of clickable structured web contents for name entity queries
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Scalable information extraction for web queries

The dominant way to find information on the web nowadays is through search. General search engines are very effective, but search phrases and results are unstructured and that limits a user's ability to further automate the processing of the search ...
VN-KIM IE: Automatic Extraction of Vietnamese Named-Entities on the Web
Abstract
The most fascinating advantage of the semantic web would be its capability of understanding and processing the contents of web pages automatically. Basically, the semantic web realization involves two main tasks: (1) Representation and management ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Copyright © 2010 International World Wide Web Conference Committee (IW3C2).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
526
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang JLiu JLin CCheng PBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search LogsProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806479(1441-1450)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806479
Saha Roy RKatare RGanguly NLaxman SChoudhury M(2015)Discovering and understanding word level user intent in Web search queriesWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2014.07.01030:C(22-38)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.websem.2014.07.010
Saha Roy RSuresh AGanguly NChoudhury MSchwabe DAlmeida VGlaser HBaeza-Yates RMoon S(2013)Place valueProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487862(153-154)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1145/2487788.2487862
Gupta MHan J(2011)Heterogeneous network-based trust analysisACM SIGKDD Explorations Newsletter10.1145/2031331.203134113:1(54-71)Online publication date: 31-Aug-2011
https://dl.acm.org/doi/10.1145/2031331.2031341
Yin XTan WLiu CSadagopan SRamamritham KKumar ARavindra MBertino EKumar R(2011)FACTOProceedings of the 20th international conference on World wide web10.1145/1963405.1963477(507-516)Online publication date: 28-Mar-2011
https://dl.acm.org/doi/10.1145/1963405.1963477
Yin XTan WSadagopan SRamamritham KKumar ARavindra MBertino EKumar R(2011)Semi-supervised truth discoveryProceedings of the 20th international conference on World wide web10.1145/1963405.1963439(217-226)Online publication date: 28-Mar-2011
https://dl.acm.org/doi/10.1145/1963405.1963439
Roy RKatare RGanguly NLaxman SChoudhury M(undefined)Discovering and Understanding Word Level User Intent in Web Search QueriesSSRN Electronic Journal10.2139/ssrn.3199173
https://doi.org/10.2139/ssrn.3199173

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

EPUB

View this article in ePub.

Figures

Tables

Media

View Table of Conten