| Fully automatic wrapper generation for search engines |
| Full text |
Pdf
(316 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 14th international conference on World Wide Web
table of contents
Chiba, Japan
SESSION: Data extraction
table of contents
Pages: 66 - 75
Year of Publication: 2005
ISBN:1-59593-046-9
|
|
Authors
|
|
Hongkun Zhao
|
SUNY at Binghamton, Binghamton, NY
|
|
Weiyi Meng
|
SUNY at Binghamton, Binghamton, NY
|
|
Zonghuan Wu
|
Univ. of Louisiana at Lafayette, Lafayette, LA
|
|
Vijay Raghavan
|
Univ. of Louisiana at Lafayette, Lafayette, LA
|
|
Clement Yu
|
University of Illinois at Chicago, Chicago, IL
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 24, Downloads (12 Months): 242, Citation Count: 14
|
|
|
ABSTRACT
When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
M. Bergman. The Deep Web: Surfacing Hidden Value. White Paper, BrightPlanet, 2000 (www.completeplanet.com/ Tutorials/DeepWeb/index.asp)
|
| |
5
|
|
 |
6
|
|
| |
7
|
K. Chang, B. He, C. Li, M. P, Z. Zhang. Structured Databases on the Web: Observations and Implications. Technical Report, UIUCDCS-R-2003-2321, UIUC, 2003.
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
www.cs.binghamton.edu/~meng/metasearch.html.
|
 |
12
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
13
|
E. Gold. Language Identification in the Limit. Information and Control, 10(5), 1967.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14:249-260, 1995.
|
 |
26
|
|
 |
27
|
|
| |
28
|
Zonghuan Wu , Vijay Raghavan , Hua Qian , Vuyyuru Rama , Weiyi Meng , Hai He , Clement Yu, Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine, Proceedings of the IEEE/WIC International Conference on Web Intelligence, p.658, October 13-17, 2003
|
| |
29
|
|
CITED BY 14
|
Yiyao Lu , Zonghuan Wu , Hongkun Zhao , Weiyi Meng , King-Lup Liu , Vijay Raghavan , Clement Yu, MySearchView: a customized metasearch engine generator, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
King-Lup Liu , Weiyi Meng , Jing Qiu , Clement Yu , Vijay Raghavan , Zonghuan Wu , Yiyao Lu , Hai He , Hongkun Zhao, AllInOneNews: development and evaluation of a large-scale news metasearch engine, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|