ACM Home Page
Please provide us with feedback. Feedback
As we may perceive: inferring logical documents from hypertext
Full text PdfPdf (671 KB)
Source Conference on Hypertext and Hypermedia archive
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia table of contents
Salzburg, Austria
SESSION: Quantifying and computing with structure table of contents
Pages: 66 - 74  
Year of Publication: 2005
ISBN:1-59593-168-6
Authors
Pavel Dmitriev  Cornell University, Ithaca, NY
Carl Lagoze  Cornell University, Ithaca, NY
Boris Suchkov  Cornell University, Ithaca, NY
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 0,   Downloads (12 Months): 47,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1083356.1083370
What is a DOI?

ABSTRACT

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called "compound documents". We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating quality of clusterings, based on a user behavior model. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
Cai, D. He, X., Wen, J.-R., Ma, W.-Y. Block-level Link Analysis. Microsoft Research Technical Report, MSR-TR-2004-50, 2004.
 
4
Cai, D., He, X., Ma, W.-Y., Wen, J.-R., Zhang, H.-J. Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure. IEEE International Conference on Multimedia and EXPO, 2004.
5
 
6
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Extracting content structure for web pages based on visual representation. 5th Asia Pacific Web Conference, 2003.
 
7
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. VIPS: a Visionbased Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003.
 
8
 
9
Davidson, B.D. Recognizing nepotistic links on the Web. 7th National Conference on Artificial Intelligence, Workshop on Artificial Intelligence for Web Search, 2000.
10
 
11
Faaborg, A., Lagoze, C. "Semantic Browsing". 8th European Conference on Digital Libraries, 2004.
12
13
 
14
 
15
iVia Under the Hood. http://infomine.ucr.edu/iVia/newtech.shtml
 
16
 
17
 
18
19
20
 
21
Mitchell, T., M. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. http://www.cs.cmu.edu/~tom/mlbook/NbayesLogReg-2-05.pdf.
22
 
23
National Science, Technology, Engeneering, and Mathematics Education Digital Library, http://www.ehr.nsf.gov/ehr/due/programs/nsdl/.
 
24
Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K. Discovery and Retrieval of Logical Information Units in Web. ACM Digital Library Workshop on Organizing Web Space, 1999.
 
25
Taskar, B., Abbeel, P., Koller, D. Discriminative Probabilistic Models for Relational Data. 18th Conference on Uncertainty in Artificial Intelligence, 2002.
26
27



REVIEW

"Xiaoya Tang : Reviewer"

Most current research on Web page clustering is based on link structure analysis using predefined criteria, such as user queries. The authors propose an approach to grouping Web pages based on features of both link structure and content, with the   more...

Collaborative Colleagues:
Pavel Dmitriev: colleagues
Carl Lagoze: colleagues
Boris Suchkov: colleagues