ACM Home Page
Please provide us with feedback. Feedback
Hierarchical topic segmentation of websites
Full text PdfPdf (1.05 MB)
Source Conference on Knowledge Discovery in Data archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 257 - 266  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Ravi Kumar  Yahoo! Research, Sunnyvale, CA
Kunal Punera  University of Texas at Austin, Austin, TX
Andrew Tomkins  Yahoo! Research, Sunnyvale, CA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 157,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150433
What is a DOI?

ABSTRACT

In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
D. J. Aumueller. A tool for gathering, analysing, exporting, and visualizing the structure of a website. Master's thesis, University of Leeds, Institute of Communications Studies, 2003.
 
3
Arindam Banerjee , Inderjit S. Dhillon , Joydeep Ghosh , Suvrit Sra, Clustering on the Unit Hypersphere using von Mises-Fisher Distributions, The Journal of Machine Learning Research, 6, p.1345-1382, 9/1/2005
 
4
5
6
 
7
L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.
8
 
9
 
10
11
12
 
13
 
14
15
16
 
17
W. L. Hsu. The distance-domination numbers of trees. Operations Research Letters, 1:96--100, 1982.
18
 
19
O. Kariv and S. L. Haikim. An algorithmic approach to network location problems, part II: p-medians. SIAM J. on Applied Mathematics, 37:539--560, 1979.
 
20
 
21
H.-P. Kriegel and M. Schubert. Classification of websites as sets of feature vectors. In IASTED Intl. Conf. on Databases and Applications, pages 127--132, 2004.
 
22
J. Pierre. Practical issues for automated categorization of web sites. In ECDL 2000 Workshop on Semantic Web, 2000.
 
23
B. Piwowarski, L. Denoyer, and P. Gallinari. Un modèle pour la recherche d'information sur des documents structurés. In 6th Journées internationales d'Analyse statistique des Données Textuelles, 2002.
 
24
J. R. Quinlan. Induction of decision trees. In J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann, 1990. Originally in Machine Learning 1:81--106, 1986.
 
25
26
 
27
28
 
29
A. Tamir. An o(pn2) algorithm for the p-median and related problems on tree graphs. Operations Research Letters, 19:59--64, 1996.
30
 
31
 
32
M. Theobald, R. Schenkel, and G. Weikum. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In 6th WebDB, pages 1--6, 2003.
 
33


Collaborative Colleagues:
Ravi Kumar: colleagues
Kunal Punera: colleagues
Andrew Tomkins: colleagues