|
ABSTRACT
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called "compound documents". We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating quality of clusterings, based on a user behavior model. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
Cai, D. He, X., Wen, J.-R., Ma, W.-Y. Block-level Link Analysis. Microsoft Research Technical Report, MSR-TR-2004-50, 2004.
|
| |
4
|
Cai, D., He, X., Ma, W.-Y., Wen, J.-R., Zhang, H.-J. Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure. IEEE International Conference on Multimedia and EXPO, 2004.
|
 |
5
|
|
| |
6
|
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Extracting content structure for web pages based on visual representation. 5th Asia Pacific Web Conference, 2003.
|
| |
7
|
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. VIPS: a Visionbased Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003.
|
| |
8
|
|
| |
9
|
Davidson, B.D. Recognizing nepotistic links on the Web. 7th National Conference on Artificial Intelligence, Workshop on Artificial Intelligence for Web Search, 2000.
|
 |
10
|
|
| |
11
|
Faaborg, A., Lagoze, C. "Semantic Browsing". 8th European Conference on Digital Libraries, 2004.
|
 |
12
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
 |
13
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
14
|
Hui Han , C. Lee Giles , Eren Manavoglu , Hongyuan Zha , Zhenyue Zhang , Edward A. Fox, Automatic document metadata extraction using support vector machines, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
| |
15
|
iVia Under the Hood. http://infomine.ucr.edu/iVia/newtech.shtml
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
Wen-Syan Li , K. Selçuk Candan , Quoc Vu , Divyakant Agrawal, Retrieving and organizing web pages by “information unit”, Proceedings of the 10th international conference on World Wide Web, p.230-244, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372057]
|
 |
20
|
Wen-Syan Li , Okan Kolak , Quoc Vu , Hajime Takano, Defining logical domains in a web site, Proceedings of the eleventh ACM on Hypertext and hypermedia, p.123-132, May 30-June 03, 2000, San Antonio, Texas, United States
[doi> 10.1145/336296.336345]
|
| |
21
|
Mitchell, T., M. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. http://www.cs.cmu.edu/~tom/mlbook/NbayesLogReg-2-05.pdf.
|
 |
22
|
|
| |
23
|
National Science, Technology, Engeneering, and Mathematics Education Digital Library, http://www.ehr.nsf.gov/ehr/due/programs/nsdl/.
|
| |
24
|
Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K. Discovery and Retrieval of Logical Information Units in Web. ACM Digital Library Workshop on Organizing Web Space, 1999.
|
| |
25
|
Taskar, B., Abbeel, P., Koller, D. Discriminative Probabilistic Models for Relational Data. 18th Conference on Uncertainty in Artificial Intelligence, 2002.
|
 |
26
|
|
 |
27
|
|
REVIEW
"Xiaoya Tang : Reviewer"
Most current research on Web page clustering is based on link structure analysis using predefined criteria, such as user queries. The authors propose an approach to grouping Web pages based on features of both link structure and content, with the
more...
|