|
ABSTRACT
Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels ofthe text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
Berthold Crysmann , Anette Frank , Bernd Kiefer , Stefan Müller , Günter Neumann , Jakub Piskorski , Ulrich Schäfer , Melanie Siegel , Hans Uszkoreit , Feiyu Xu , Markus Becker , Hans-Ulrich Krieger, An integrated architecture for shallow and deep processing, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania
[doi> 10.3115/1073083.1073157]
|
 |
8
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
9
|
D. DiPasquo. Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web. Senior Honors Thesis, Carnegie Mellon University, 1998.
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
F. V. Jensen, S. L. Lauritzen and K. G. Olesen. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
K. Lerman, S. Minton and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
|
| |
18
|
|
 |
19
|
|
| |
20
|
S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. Proc. of NIPS, 2004.
|
 |
21
|
|
| |
22
|
S. Soderland. Learning to Extract Text-based Information from the World Wide Web. Proc. of SIGKDD, 1997.
|
| |
23
|
|
 |
24
|
|
| |
25
|
T. Yoshimasa and T. Jun'ichi. Chunk Parsing Revisited. Proc. of the 9th International Workshop on Parsing Technologies, 2005.
|
 |
26
|
|
 |
27
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
 |
28
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, 2D Conditional Random Fields for Web information extraction, Proceedings of the 22nd international conference on Machine learning, p.1044-1051, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102483]
|
 |
29
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150457]
|
|