Hierarchic document classification using Ward's clustering method

Authors:
A. El-Hamdouchi

Sheffield University, Western Bank, Sheffield, S10 2TN, UK

Sheffield University, Western Bank, Sheffield, S10 2TN, UK
View Profile

,
P. Willett

Sheffield University, Western Bank, Sheffield, S10 2TN, UK

Sheffield University, Western Bank, Sheffield, S10 2TN, UK
View Profile

SIGIR '86: Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrievalSeptember 1986Pages 149–156https://doi.org/10.1145/253168.253200

Published:01 September 1986Publication History

SIGIR '86: Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 149–156

ABSTRACT

In this paper, we discuss the application of a recent hierarchic clustering algorithm to the automatic classification of files of documents. Whereas most hierarchic clustering algorithms involve the generation and updating of an inter-object dissimilarity matrix, this new algorithm is based upon a series of nearest neighbor searches. Such an approach is appropriate to several clustering methods, including Ward's method which has been shown to perform well in experimental studies of hierarchic document clustering. A description is given of heuristics which can increase the efficiency of the new algorithm when it is used to cluster three document collections by Ward's method.

References

BENT80.Bentley, J.L., Wetde, B.W. and YAK), A.C. (1980), "OptLma# ExpeQ#ed-Ttme A1Borithms for Closest Point FYoble#s ," ACH Transaatlons on Ha them at iQ al Soft,are, 6, 563-580. Google ScholarDigital Library
CROF77.Croft, W.B. (1977), "Clustering Large Files of Documents Usln8 the 5i,81 e-L#nk Hethod," Journal of the #merlcan Soetety for Infoneaf.:Lon S#tence, 28, 3" 1-3#q.Google Scholar
CROF80.Croft, W.B. (1980), "A Hodel of (:luster Searching Using Classsi~1cation," information S st.#,,.#_#, 5, 189-1 95.Google Scholar
DAYE84.Day, W.B.E and Edelabrunner, H. (198#), "Efftolent Algorithms for #glomerattve Hlerarchioal ClusterlnE Hethoda,e Journal of Classification, 1, 7-2q.Google Scholar
DEFA77.} De lays, O. (1977), "# F#ft~ lent A1gorlthm for a Ccmplete Link Hethod," Journa______11, 20, 93-95.Google Scholar
FRIE77.Friedman, J.H., Bentley, J.L. and Finkel, R.A. (I 977), "b ~1gorlth# for Fird1.i Bes# Hatches In Lo#arlthmt~ Expected Time," A(;M Transaetlons on l#thel#a1#2C#L# #oft#lare. 3, 209-- 226. Google ScholarDigital Library
GOFF69.Goff#en, W. (1969), "# Indlreot Hethod of Znfomation Retrieval," #nformatlon $toral#e and Retr i ev al, q, 363-3 73.Google Scholar
GRIF84.Grifflths, A. , Robinson, L.A. and WilIett, P. (198JI), "Hierarehle Agglomerative C1 ustering #Mthods for #utomatto #eument Classification ," Journal of Documentation, qO, 175-205.Google Scholar
GRIF86.Grlffiths, A. , Luekhurst, H.C. and W111ett, P. (1 986), "Uslng Inter-I#eument Similarity Information in Ibeum ent Retrieval Systems," Journal of the American Soelety for Information Science, 37, 3-i I. Google ScholarDigital Library
JARD71a.Jard ine, N. and Sib son, R. (1971), Ha%hematlcal Taxonomy New York: Wiley.Google Scholar
JARD71b.Jardine, tl. and Van RlJsbergen, C.J. (1971), "The Use of Hierarchic Clustering in Information Retrieval," Information #orage and Retrieval, 7, 217-2q0.Google ScholarCross Ref
KITT78.Kittler, J. (1978), "A Method for Determining K-Nearest NeIEhbotrs ," K ybernetes, 7, 313-315.Google ScholarCross Ref
LANC67.Lance, G.N. and Williams, W.T. (1967), "A General Theory of Classificatory Sorting Strat#les. I. Hierarchical Systems," Jo ur na__#l, 9, 373-3 80.Google Scholar
MURT82.HurtaKh, F. (1982), "A Very Fast, Exaat Nearest NelEhbour Algorithm for Use in Information Retrieval," Information Technology: Research and Development, I, 275-283.Google Scholar
MURT83.Murtagh, F. (1983), "A SUrvey of Peoent Advances in Hierarchical Clustering Algorithms," Com#uter Journal, 26, 354-359.Google ScholarCross Ref
MURT84a.rlurtagh, F. (198q), "Complexities of Hierarchic Clustering Algorithms: State of the Art," Computational Statistlcs Quarterly, 1, 101-1 13.Google Scholar
MURT84b.MurtsEh, F. (1984), "A Revlew of Fast Techniques for Nearest Neighbour Searching," in ~(#F3TAT 198q, Vienna: Physloa-Verl#.Google Scholar
NORE77.Moreault, T., F#I1, H. and MeGill, H.J. (1977), "Autcmatle Ranked Output from Boolean Searches in SIRE," Journal of the #eriean Soelety for Information Science, 28, 333#339.Google Scholar
PERR83.Perry, S.A. and Willett, P. (1983), "A Review of the Use of Inverted Files for Best Match Retr lev al in In form ation Retr iev al Systems," Journal of Infomatlon Science 6, 59- 66.Google ScholarDigital Library
POGU84.Pogue, C.A. and Willett, P. (198q),-An Evaluatt.on of Doeumen# Retrieval from Serial Files Using the ICL I# sir Ib u# ed #ray Processor ," Online Review 8, 569-58q.Google Scholar
SALT71.Salton, G. (1971), The S1ART Retrieval System, New Jersey: Prentlee-4lall.Google Scholar
SALT81.Salton, G. and Bergmark, D. (i 981 ), "Parall el Computations in Information Retrleval," I#ture Motes in Computer Selenee, 111, 328-3 q2. Google ScholarDigital Library
SALT83.Salton, G. and #GilI, H.J. (1983), XntroduQtlon to t#dern Information Retrleval, #ew Yet'k: +#raw-mill. Google ScholarDigital Library
SIBS73.Sibson, R. (1973), "BLINK: an Optimally Efflclent A#gorlthm for the Slngle-bink C#usber Method ," Computer Journal, 1 6, #0-3#.Google Scholar
SMEA81.Smeaton, A.F. and Van RIJ#bergen, C.J. (1981), "The Nearem# Neighbour Problem in Information Retrieval. An Algorithm Uslr# Upperbourds," ACH S.IGIR Forum, 16, 83-8?. Google ScholarDigital Library
VANR74.Van R#j#beraen, C.J. (1974), "Further Experiments with Hierarehie Clustering in Document Retrieval ," Information Storage and Retrieval, 10, 1-1#.Google ScholarCross Ref
VANR75.Van Pdjsbergen, C.J. and Croft, W.B. (1975), "Document Clustering: an Evaluation of some Experiments with the Cran field 1gO0 Collection," Information StorsEe and Retrieval, 11, 171-182.Google Scholar
VANR79.Van Rijsbergen, c.a. (1979), Information Retrieval London: BUtteraprth. Google ScholarDigital Library
WEIS81.Weiss, S.F. (1981), "A Probablllstie AIEorithm for Nearest Nei#hbouc Searching," in Oddy, R.N, et al. (eds.) Information Retrieval Research# London: Butterwprth. Google ScholarDigital Library

Hierarchic document classification using Ward's clustering method
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Evaluation of hierarchical clustering algorithms for document datasets
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical ...
Read More
Semi-supervised agglomerative hierarchical clustering with ward method using clusterwise tolerance
MDAI'11: Proceedings of the 8th international conference on Modeling decisions for artificial intelligence

This paper presents a new semi-supervised agglomerative hierarchical clustering algorithm with ward method using clusterwise tolerance. Recently, semi-supervised clustering has been remarked and studied in many research fields. In semi-supervised ...
Read More
Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '86: Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
September 1986
283 pages
ISBN:0897911873
DOI:10.1145/253168
Chairman:
Luigi Rossi Bernardi
CNR, Italy
,
Editor:
Fausto Rabitti
IEI-CNR, Italy
Copyright © 1986 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 1986
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 2,012
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchic document classification using Ward's clustering method

SIGIR '86: Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Recommendations

Evaluation of hierarchical clustering algorithms for document datasets

Semi-supervised agglomerative hierarchical clustering with ward method using clusterwise tolerance

Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval