|
ABSTRACT
In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
D. J. Aumueller. A tool for gathering, analysing, exporting, and visualizing the structure of a website. Master's thesis, University of Leeds, Institute of Communications Studies, 2003.
|
| |
3
|
Arindam Banerjee , Inderjit S. Dhillon , Joydeep Ghosh , Suvrit Sra, Clustering on the Unit Hypersphere using von Mises-Fisher Distributions, The Journal of Machine Learning Research, 6, p.1345-1382, 9/1/2005
|
| |
4
|
|
 |
5
|
|
 |
6
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
7
|
L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
Martin Ester , Hans-Peter Kriegel , Matthias Schubert, Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775084]
|
 |
12
|
Ronald Fagin , R. Guha , Ravi Kumar , Jasmine Novak , D. Sivakumar , Andrew Tomkins, Multi-structural databases, Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 13-15, 2005, Baltimore, Maryland
[doi> 10.1145/1065167.1065191]
|
| |
13
|
R. Fagin , Ph. Kolaitis , R. Kumar , J. Novak , D. Sivakumar , A. Tomkins, Efficient implementation of large-scale multi-structural databases, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
W. L. Hsu. The distance-domination numbers of trees. Operations Research Letters, 1:96--100, 1982.
|
 |
18
|
|
| |
19
|
O. Kariv and S. L. Haikim. An algorithmic approach to network location problems, part II: p-medians. SIAM J. on Applied Mathematics, 37:539--560, 1979.
|
| |
20
|
|
| |
21
|
H.-P. Kriegel and M. Schubert. Classification of websites as sets of feature vectors. In IASTED Intl. Conf. on Databases and Applications, pages 127--132, 2004.
|
| |
22
|
J. Pierre. Practical issues for automated categorization of web sites. In ECDL 2000 Workshop on Semantic Web, 2000.
|
| |
23
|
B. Piwowarski, L. Denoyer, and P. Gallinari. Un modèle pour la recherche d'information sur des documents structurés. In 6th Journées internationales d'Analyse statistique des Données Textuelles, 2002.
|
| |
24
|
J. R. Quinlan. Induction of decision trees. In J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann, 1990. Originally in Machine Learning 1:81--106, 1986.
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
A. Tamir. An o(pn2) algorithm for the p-median and related problems on tree graphs. Operations Research Letters, 19:59--64, 1996.
|
 |
30
|
|
| |
31
|
|
| |
32
|
M. Theobald, R. Schenkel, and G. Weikum. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In 6th WebDB, pages 1--6, 2003.
|
| |
33
|
|
|