ACM Home Page
Please provide us with feedback. Feedback
Fully automatic wrapper generation for search engines
Full text PdfPdf (316 KB)
Source International World Wide Web Conference archive
Proceedings of the 14th international conference on World Wide Web table of contents
Chiba, Japan
SESSION: Data extraction table of contents
Pages: 66 - 75  
Year of Publication: 2005
ISBN:1-59593-046-9
Authors
Hongkun Zhao  SUNY at Binghamton, Binghamton, NY
Weiyi Meng  SUNY at Binghamton, Binghamton, NY
Zonghuan Wu  Univ. of Louisiana at Lafayette, Lafayette, LA
Vijay Raghavan  Univ. of Louisiana at Lafayette, Lafayette, LA
Clement Yu  University of Illinois at Chicago, Chicago, IL
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 242,   Citation Count: 14
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1060745.1060760
What is a DOI?

ABSTRACT

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
M. Bergman. The Deep Web: Surfacing Hidden Value. White Paper, BrightPlanet, 2000 (www.completeplanet.com/ Tutorials/DeepWeb/index.asp)
 
5
6
 
7
K. Chang, B. He, C. Li, M. P, Z. Zhang. Structured Databases on the Web: Observations and Implications. Technical Report, UIUCDCS-R-2003-2321, UIUC, 2003.
8
 
9
 
10
 
11
www.cs.binghamton.edu/~meng/metasearch.html.
12
 
13
E. Gold. Language Identification in the Limit. Information and Control, 10(5), 1967.
 
14
 
15
 
16
 
17
 
18
N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.
19
20
 
21
22
23
 
24
 
25
E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14:249-260, 1995.
26
27
 
28
 
29

CITED BY  14
 
 
 
 

Collaborative Colleagues:
Hongkun Zhao: colleagues
Weiyi Meng: colleagues
Zonghuan Wu: colleagues
Vijay Raghavan: colleagues
Clement Yu: colleagues