| AutoFeed: an unsupervised learning system for generating webfeeds |
| Full text |
Pdf
(296 KB)
|
| Source
|
International Conference On Knowledge Capture
archive
Proceedings of the 3rd international conference on Knowledge capture
table of contents
Banff, Alberta, Canada
SESSION: Information extraction
table of contents
Pages: 3 - 10
Year of Publication: 2005
ISBN:1-59593-163-5
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 55, Citation Count: 0
|
|
|
ABSTRACT
Our goal is to automatically extract data from semi-structured webn sites. Previously, researchers have developed two types of supervised learning approaches for extracting web data: methods that create precise, site-specific extraction rules and methods that learn less-precise site-independent extraction rules. In either case, significant training is required. In this paper, we describe a third, more ambitious approach, where we use unsupervised learning to analyze sites and discover their structure. Our method relies on a set of heterogeneous "experts", each of which is capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints". Based on these hints, our system clusters the pages and identifies semi-structured data that can be extracted. To identify a good clustering, we use a probabilistic model of the hint-generation process. The paper describes our formulation of the fully-automatic web-extraction problem, our clustering approach, and our results on a set of experiments.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
William W. Cohen , Yoram Singer, A simple, fast, and effective rule learner, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.335-342, July 18-22, 1999, Orlando, Florida, United States
|
| |
5
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
K. Lerman, S. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. JAIR, 18:149--181, 2003.
|
| |
15
|
S. Minton, S. I. Ticrea, and J. Beach. Trainability: Developing a responsive learning system. In IIWeb, pages 27--32, 2003.
|
 |
16
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988740]
|
| |
17
|
Noam M. Shazeer , Michael L. Littman , Greg A. Keim, Solving crossword puzzles as probabilistic constraint satisfaction, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.156-162, July 18-22, 1999, Orlando, Florida, United States
|
|