skip to main content
10.1145/1772690.1772791acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Automatic extraction of clickable structured web contents for name entity queries

Published: 26 April 2010 Publication History

Abstract

Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users' search trails (i.e., post-search browsing behaviors) as implicit labels on the relevance between web contents and user queries. Based on such labels we use information extraction approach to build wrappers and extract structured information. An important observation is that many web sites contain pages for name entities of certain categories (e.g., AOL Music contains a page for each musician), and these pages have the same format. This makes it possible to build wrappers from a small amount of implicit labels, and use them to extract structured information from many web pages for different name entities. We propose STRUCLICK, a fully automated system for extracting structured information for queries containing name entities of certain categories. It can identify important web sites from web search logs, build wrappers from users' search trails, filter out bad wrappers built from random user clicks, and combine structured information from different web sites for each query. Comparing with existing approaches on information extraction, STRUCLICK can assign semantics to extracted data without any human labeling or supervision. We perform comprehensive experiments, which show STRUCLICK achieves high accuracy and good scalability.

References

[1]
Amazon Mechanical Turk. https://www.mturk.com/mturk/
[2]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. SIGMOD'03.
[3]
T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.
[4]
M. Bilenko, R. W. White. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. WWW'08.
[5]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. IJCAI'07.
[6]
M. J. Cafarella, A. Halevy, N. Khoussainova. Data Integra--tion for the Relational Web. VLDB'09.
[7]
C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. WWW'01.
[8]
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. VLDB'01.
[9]
S. Dill et al. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. WWW'03.
[10]
X. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. VLDB'09.
[11]
J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. SIGIR'09.
[12]
N. Kushmerick. Wrapper induction for information extraction. PhD thesis (1997).
[13]
A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of Web data extraction tools. ACM SIGMOD Record, 31(2):84--93, 2002.
[14]
X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. SIGIR'08.
[15]
B. Liu. Mining data records in Web pages. KDD'03.
[16]
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. WWW'09.
[17]
S. Mukherjee and I.V. Ramakrishnan. Automated semantic analysis of schematic data. World Wide Web Journal. 11(4): 427--464 (2008).
[18]
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. AGENTS'99.
[19]
M. Pa_ca. Organizing and searching the world wide web of facts - step two: harnessing the wisdom of the crowds. WWW'07.
[20]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. KDD'07.
[21]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. WWW'05.
[22]
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf. Learning with local and global consistency. NIPS'03.
[23]
D. Zhou, J. Huang, B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. ICML'05.

Cited By

View all
  • (2015)Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search LogsProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806479(1441-1450)Online publication date: 17-Oct-2015
  • (2015)Discovering and understanding word level user intent in Web search queriesWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2014.07.01030:C(22-38)Online publication date: 1-Jan-2015
  • (2013)Place valueProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487862(153-154)Online publication date: 13-May-2013
  • Show More Cited By

Index Terms

  1. Automatic extraction of clickable structured web contents for name entity queries

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '10: Proceedings of the 19th international conference on World wide web
    April 2010
    1407 pages
    ISBN:9781605587998
    DOI:10.1145/1772690

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information extraction
    2. web search

    Qualifiers

    • Research-article

    Conference

    WWW '10
    WWW '10: The 19th International World Wide Web Conference
    April 26 - 30, 2010
    North Carolina, Raleigh, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search LogsProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806479(1441-1450)Online publication date: 17-Oct-2015
    • (2015)Discovering and understanding word level user intent in Web search queriesWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2014.07.01030:C(22-38)Online publication date: 1-Jan-2015
    • (2013)Place valueProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487862(153-154)Online publication date: 13-May-2013
    • (2011)Heterogeneous network-based trust analysisACM SIGKDD Explorations Newsletter10.1145/2031331.203134113:1(54-71)Online publication date: 31-Aug-2011
    • (2011)FACTOProceedings of the 20th international conference on World wide web10.1145/1963405.1963477(507-516)Online publication date: 28-Mar-2011
    • (2011)Semi-supervised truth discoveryProceedings of the 20th international conference on World wide web10.1145/1963405.1963439(217-226)Online publication date: 28-Mar-2011
    • (undefined)Discovering and Understanding Word Level User Intent in Web Search QueriesSSRN Electronic Journal10.2139/ssrn.3199173

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    EPUB

    View this article in ePub.

    ePub

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media