ACM Home Page
Please provide us with feedback. Feedback
AutoFeed: an unsupervised learning system for generating webfeeds
Full text PdfPdf (296 KB)
Source International Conference On Knowledge Capture archive
Proceedings of the 3rd international conference on Knowledge capture table of contents
Banff, Alberta, Canada
SESSION: Information extraction table of contents
Pages: 3 - 10  
Year of Publication: 2005
ISBN:1-59593-163-5
Authors
Bora Gazen  Fetch Technologies, El Segundo, CA
Steven Minton  Fetch Technologies, El Segundo, CA
Sponsors
ACM: Association for Computing Machinery
SIGART: ACM Special Interest Group on Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 55,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088622.1088625
What is a DOI?

ABSTRACT

Our goal is to automatically extract data from semi-structured webn sites. Previously, researchers have developed two types of supervised learning approaches for extracting web data: methods that create precise, site-specific extraction rules and methods that learn less-precise site-independent extraction rules. In either case, significant training is required. In this paper, we describe a third, more ambitious approach, where we use unsupervised learning to analyze sites and discover their structure. Our method relies on a set of heterogeneous "experts", each of which is capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints". Based on these hints, our system clusters the pages and identifies semi-structured data that can be extracted. To identify a good clustering, we use a probabilistic model of the hint-generation process. The paper describes our formulation of the fully-automatic web-extraction problem, our clustering approach, and our results on a set of experiments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
 
5
 
6
7
 
8
 
9
 
10
 
11
 
12
13
 
14
K. Lerman, S. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. JAIR, 18:149--181, 2003.
 
15
S. Minton, S. I. Ticrea, and J. Beach. Trainability: Developing a responsive learning system. In IIWeb, pages 27--32, 2003.
16
 
17

Collaborative Colleagues:
Bora Gazen: colleagues
Steven Minton: colleagues