ACM Home Page
Please provide us with feedback. Feedback
Webpage understanding: an integrated approach
Full text MovMov (16:30),  PdfPdf (1.19 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 903 - 912  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Jun Zhu  Tsinghua University
Bo Zhang  Tsinghua University
Zaiqing Nie  Microsoft Research Asia
Ji-Rong Wen  Microsoft Research Asia
Hsiao-Wuen Hon  Microsoft Research Asia
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 44,   Downloads (12 Months): 291,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281288
What is a DOI?

ABSTRACT

Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels ofthe text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
4
5
 
6
 
7
8
 
9
D. DiPasquo. Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web. Senior Honors Thesis, Carnegie Mellon University, 1998.
 
10
 
11
 
12
 
13
F. V. Jensen, S. L. Lauritzen and K. G. Olesen. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.
 
14
 
15
16
 
17
K. Lerman, S. Minton and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
 
18
19
 
20
S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. Proc. of NIPS, 2004.
21
 
22
S. Soderland. Learning to Extract Text-based Information from the World Wide Web. Proc. of SIGKDD, 1997.
 
23
24
 
25
T. Yoshimasa and T. Jun'ichi. Chunk Parsing Revisited. Proc. of the 9th International Workshop on Parsing Technologies, 2005.
26
27
28
29

Collaborative Colleagues:
Jun Zhu: colleagues
Bo Zhang: colleagues
Zaiqing Nie: colleagues
Ji-Rong Wen: colleagues
Hsiao-Wuen Hon: colleagues