Article

Clustering web documents with tables for information extraction

Authors:

Kostyantyn Shchekotykhin,

Dietmar Jannach,

Gerhard FriedrichAuthors Info & Claims

K-CAP '07: Proceedings of the 4th international conference on Knowledge capture

Pages 169 - 170

https://doi.org/10.1145/1298406.1298438

Published: 28 October 2007 Publication History

Get Access

Abstract

One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form - which is for instance a typical way of describing items in online shops - existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure > 0.9).

References

[1]

D. Pelleg and A. W. Moore. X--means: Extending k--means with efficient estimation of the number of clusters. In Seventeenth International Conference on Machine Learning, pages 727--734. Morgan Kaufmann, 2000.

Digital Library

Google Scholar

[2]

G. Stoilos, G. B. Stamou, and S. D. Kollias. A string metric for ontology alignment. In International Semantic Web Conference, pages 624--637, 2005.

Digital Library

Google Scholar

Cited By

View all

Shaker MIbrahim HMustapha AAbdullah LKotsis GTaniar DPardede E(2009)Information extraction from web tablesProceedings of the 11th International Conference on Information Integration and Web-based Applications & Services10.1145/1806338.1806426(470-476)Online publication date: 14-Dec-2009
https://dl.acm.org/doi/10.1145/1806338.1806426
Jannach DShchekotykhin KFriedrich G(2009)Automated ontology instantiation from tabular web sources-The AllRight systemWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2009.04.0027:3(136-153)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1016/j.websem.2009.04.002
Shchekotykhin KJannach DFriedrich G(2009)xCrawl: a high-recall crawling method for Web miningKnowledge and Information Systems10.1007/s10115-009-0266-325:2(303-326)Online publication date: 18-Nov-2009
https://doi.org/10.1007/s10115-009-0266-3
Show More Cited By

Index Terms

Clustering web documents with tables for information extraction
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Adaptive information extraction from online documents
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Semantic information extraction from Tamil documents

Semantic information extraction is a process of extracting concepts, entities, relations and entailment rules from a document. We propose an approach to extract concepts, entities and relations from the domain specific Tamil textual documents corpus. ...

Comments

Information & Contributors

Information

Published In

K-CAP '07: Proceedings of the 4th international conference on Knowledge capture

October 2007

216 pages

ISBN:9781595936431

DOI:10.1145/1298406

Editors:
Derek Sleeman
University of Aberdeen, UK
,
Ken Barker
University of Texas at Austin

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

X-means algorithm

Qualifiers

Article

Conference

K-CAP07

Sponsor:

K-CAP07: International Conference on Knowledge Capture 2007

October 28 - 31, 2007

BC, Whistler, Canada

Acceptance Rates

Overall Acceptance Rate 55 of 198 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Shaker MIbrahim HMustapha AAbdullah LKotsis GTaniar DPardede E(2009)Information extraction from web tablesProceedings of the 11th International Conference on Information Integration and Web-based Applications & Services10.1145/1806338.1806426(470-476)Online publication date: 14-Dec-2009
https://dl.acm.org/doi/10.1145/1806338.1806426
Jannach DShchekotykhin KFriedrich G(2009)Automated ontology instantiation from tabular web sources-The AllRight systemWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2009.04.0027:3(136-153)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1016/j.websem.2009.04.002
Shchekotykhin KJannach DFriedrich G(2009)xCrawl: a high-recall crawling method for Web miningKnowledge and Information Systems10.1007/s10115-009-0266-325:2(303-326)Online publication date: 18-Nov-2009
https://doi.org/10.1007/s10115-009-0266-3
Jannach DShchekotykhin KFriedrich G(undefined)Automated Ontology Instantiation from Tabular Web Sources The AllRight SystemSSRN Electronic Journal10.2139/ssrn.3199423
https://doi.org/10.2139/ssrn.3199423

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive information extraction from online documents

Web personal name disambiguation based on reference entity tables mined from the web

Semantic information extraction from Tamil documents

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations