Scatter/Gather: a cluster-based approach to browsing large document collections

Authors:
Douglass R. Cutting

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA
View Profile

,
David R. Karger

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA and Stanford University

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA and Stanford University
View Profile

,
Jan O. Pedersen

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA
View Profile

,
John W. Tukey

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA and Princeton University

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA and Princeton University
View Profile

SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrievalJune 1992Pages 318–329https://doi.org/10.1145/133160.133214

Published:01 June 1992Publication History

SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 318–329

ABSTRACT

Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.

We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

References

1.Chris Buckley and Alan F. Lewit. Optimizations of inverted vector searches. In Proceedings of the Eighth Annual International A CM SIGIR Conference on Research and Development in {nfoT'mat~on Retrieval, pages 97-110, 1985. Google ScholarDigital Library
2.W.B. Croft. Clustering large files of documents using the single-link method. Journal of the Amemcan Soczety for Informatzon Science, 28:341-344, 1977.Google ScholarCross Ref
3.A. E1-Hamdouchi and P. Willett. Hierarchical document clustering using Ward's method. In Proceedzngs of the N, nth InternatzonaI Conference on Research and Development in Informatzon Retrieval, pages 149-156, 1986. Google Scholar
4.A. Grifiiths, H.C. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sczence, 37:3-11, 1986.Google ScholarCross Ref
5.Anil K. aain and Richard C. Dubes. Algorithms for Clustering Data. Pretice Hall, Engelwood Cliffs, N.J. 07632, 1988. Google ScholarDigital Library
6.N. aardine and C.J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Informatzon Storage and Retrzeval, 7:217-240, 1971.Google ScholarCross Ref
7.O. Pedersen, D. R. Cutting, and a. w. Tukey. Snippet search: a single phrase approach to text access. In Proceedings of the 1991 Yoznt Statistical Meetings. American Statistical Association, 1991. Also available as Xerox PARC technical report SSL- 91-08.Google Scholar
8.G. Salton. The SMART Retmeval System. Prentice- Hall, Englewood Cliffs, N.J., 1971.Google Scholar
9.G. Salton and M. a. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
10.R. Sibson. SLINK: an optimally efficient algorithm for the single link cluster method. Computer Journal, 16:30-34, 1973.Google ScholarCross Ref
11.C.J. van Rijsbergen. Information Retmeval. Butterworths, London, second edition, 1979. Google ScholarDigital Library
12.C.j. van Rijsbergen and W.B. Croft. Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing Management, 11:171-182, 1975.Google ScholarCross Ref
13.P. Willett. Document clustering using an inverted file approach. Journal of Informatzon Sczence, 2:223- 231, 1980.Google ScholarCross Ref
14.P. Willett. A fast procedure for the calculation of similarity coefficients in automatic classification. Informatzon Processzng ~ Management, 17:53-60, 1981.Google Scholar
15.P. Willett. Recent trends in hierarchical document clustering: A critical review. Information Processing Management, 24(5):577-597, 1988. Google ScholarDigital Library

Index Terms

Scatter/Gather: a cluster-based approach to browsing large document collections
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections
SIGIR Test-of-Time Awardees 1978-2001

Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); ...
Read More
Constant interaction-time scatter/gather browsing of very large document collections
SIGIR '93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval

The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contents-like outlines of large document collections. Previous work [1] developed linear-time document clustering algorithms to establish the feasibility of ...
Read More
A scaleable document clustering approach for large document corpora

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
June 1992
352 pages
ISBN:0897915232
DOI:10.1145/133160
Chairman:
Edward Fox
VPI & State Univ., Blacksburg, VA
,
Editors:
Nicholas Belkin
Rutgers Univ., New Brunswick, NJ
,
Peter Ingwersen,
Annelise Mark Pejtersen
Copyright © 1992 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1992
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 935
  Total Citations
  View Citations
- 5,128
  Total Downloads
- Downloads (Last 12 months)149
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

Constant interaction-time scatter/gather browsing of very large document collections

A scaleable document clustering approach for large document corpora