ABSTRACT
In this paper, we discuss the application of a recent hierarchic clustering algorithm to the automatic classification of files of documents. Whereas most hierarchic clustering algorithms involve the generation and updating of an inter-object dissimilarity matrix, this new algorithm is based upon a series of nearest neighbor searches. Such an approach is appropriate to several clustering methods, including Ward's method which has been shown to perform well in experimental studies of hierarchic document clustering. A description is given of heuristics which can increase the efficiency of the new algorithm when it is used to cluster three document collections by Ward's method.
- BENT80.Bentley, J.L., Wetde, B.W. and YAK), A.C. (1980), "OptLma# ExpeQ#ed-Ttme A1Borithms for Closest Point FYoble#s ," ACH Transaatlons on Ha them at iQ al Soft,are, 6, 563-580. Google ScholarDigital Library
- CROF77.Croft, W.B. (1977), "Clustering Large Files of Documents Usln8 the 5i,81 e-L#nk Hethod," Journal of the #merlcan Soetety for Infoneaf.:Lon S#tence, 28, 3" 1-3#q.Google Scholar
- CROF80.Croft, W.B. (1980), "A Hodel of (:luster Searching Using Classsi~1cation," information S st.#,,.#_#, 5, 189-1 95.Google Scholar
- DAYE84.Day, W.B.E and Edelabrunner, H. (198#), "Efftolent Algorithms for #glomerattve Hlerarchioal ClusterlnE Hethoda,e Journal of Classification, 1, 7-2q.Google Scholar
- DEFA77.} De lays, O. (1977), "# F#ft~ lent A1gorlthm for a Ccmplete Link Hethod," Journa______11, 20, 93-95.Google Scholar
- FRIE77.Friedman, J.H., Bentley, J.L. and Finkel, R.A. (I 977), "b ~1gorlth# for Fird1.i Bes# Hatches In Lo#arlthmt~ Expected Time," A(;M Transaetlons on l#thel#a1#2C#L# #oft#lare. 3, 209-- 226. Google ScholarDigital Library
- GOFF69.Goff#en, W. (1969), "# Indlreot Hethod of Znfomation Retrieval," #nformatlon $toral#e and Retr i ev al, q, 363-3 73.Google Scholar
- GRIF84.Grifflths, A. , Robinson, L.A. and WilIett, P. (198JI), "Hierarehle Agglomerative C1 ustering #Mthods for #utomatto #eument Classification ," Journal of Documentation, qO, 175-205.Google Scholar
- GRIF86.Grlffiths, A. , Luekhurst, H.C. and W111ett, P. (1 986), "Uslng Inter-I#eument Similarity Information in Ibeum ent Retrieval Systems," Journal of the American Soelety for Information Science, 37, 3-i I. Google ScholarDigital Library
- JARD71a.Jard ine, N. and Sib son, R. (1971), Ha%hematlcal Taxonomy New York: Wiley.Google Scholar
- JARD71b.Jardine, tl. and Van RlJsbergen, C.J. (1971), "The Use of Hierarchic Clustering in Information Retrieval," Information #orage and Retrieval, 7, 217-2q0.Google ScholarCross Ref
- KITT78.Kittler, J. (1978), "A Method for Determining K-Nearest NeIEhbotrs ," K ybernetes, 7, 313-315.Google ScholarCross Ref
- LANC67.Lance, G.N. and Williams, W.T. (1967), "A General Theory of Classificatory Sorting Strat#les. I. Hierarchical Systems," Jo ur na__#l, 9, 373-3 80.Google Scholar
- MURT82.HurtaKh, F. (1982), "A Very Fast, Exaat Nearest NelEhbour Algorithm for Use in Information Retrieval," Information Technology: Research and Development, I, 275-283.Google Scholar
- MURT83.Murtagh, F. (1983), "A SUrvey of Peoent Advances in Hierarchical Clustering Algorithms," Com#uter Journal, 26, 354-359.Google ScholarCross Ref
- MURT84a.rlurtagh, F. (198q), "Complexities of Hierarchic Clustering Algorithms: State of the Art," Computational Statistlcs Quarterly, 1, 101-1 13.Google Scholar
- MURT84b.MurtsEh, F. (1984), "A Revlew of Fast Techniques for Nearest Neighbour Searching," in ~(#F3TAT 198q, Vienna: Physloa-Verl#.Google Scholar
- NORE77.Moreault, T., F#I1, H. and MeGill, H.J. (1977), "Autcmatle Ranked Output from Boolean Searches in SIRE," Journal of the #eriean Soelety for Information Science, 28, 333#339.Google Scholar
- PERR83.Perry, S.A. and Willett, P. (1983), "A Review of the Use of Inverted Files for Best Match Retr lev al in In form ation Retr iev al Systems," Journal of Infomatlon Science 6, 59- 66.Google ScholarDigital Library
- POGU84.Pogue, C.A. and Willett, P. (198q),-An Evaluatt.on of Doeumen# Retrieval from Serial Files Using the ICL I# sir Ib u# ed #ray Processor ," Online Review 8, 569-58q.Google Scholar
- SALT71.Salton, G. (1971), The S1ART Retrieval System, New Jersey: Prentlee-4lall.Google Scholar
- SALT81.Salton, G. and Bergmark, D. (i 981 ), "Parall el Computations in Information Retrleval," I#ture Motes in Computer Selenee, 111, 328-3 q2. Google ScholarDigital Library
- SALT83.Salton, G. and #GilI, H.J. (1983), XntroduQtlon to t#dern Information Retrleval, #ew Yet'k: +#raw-mill. Google ScholarDigital Library
- SIBS73.Sibson, R. (1973), "BLINK: an Optimally Efflclent A#gorlthm for the Slngle-bink C#usber Method ," Computer Journal, 1 6, #0-3#.Google Scholar
- SMEA81.Smeaton, A.F. and Van RIJ#bergen, C.J. (1981), "The Nearem# Neighbour Problem in Information Retrieval. An Algorithm Uslr# Upperbourds," ACH S.IGIR Forum, 16, 83-8?. Google ScholarDigital Library
- VANR74.Van R#j#beraen, C.J. (1974), "Further Experiments with Hierarehie Clustering in Document Retrieval ," Information Storage and Retrieval, 10, 1-1#.Google ScholarCross Ref
- VANR75.Van Pdjsbergen, C.J. and Croft, W.B. (1975), "Document Clustering: an Evaluation of some Experiments with the Cran field 1gO0 Collection," Information StorsEe and Retrieval, 11, 171-182.Google Scholar
- VANR79.Van Rijsbergen, c.a. (1979), Information Retrieval London: BUtteraprth. Google ScholarDigital Library
- WEIS81.Weiss, S.F. (1981), "A Probablllstie AIEorithm for Nearest Nei#hbouc Searching," in Oddy, R.N, et al. (eds.) Information Retrieval Research# London: Butterwprth. Google ScholarDigital Library
- Hierarchic document classification using Ward's clustering method
Recommendations
Evaluation of hierarchical clustering algorithms for document datasets
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementFast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical ...
Semi-supervised agglomerative hierarchical clustering with ward method using clusterwise tolerance
MDAI'11: Proceedings of the 8th international conference on Modeling decisions for artificial intelligenceThis paper presents a new semi-supervised agglomerative hierarchical clustering algorithm with ward method using clusterwise tolerance. Recently, semi-supervised clustering has been remarked and studied in many research fields. In semi-supervised ...
Comments