ABSTRACT
Building content-based search tools for feature-rich data has been a challenging problem because feature-rich data such as audio recordings, digital images, and sensor data are inherently noisy and high dimensional. Comparing noisy data requires comparisons based on similarity instead of exact matches, and thus searching for noisy data requires similarity search instead of exact search.The Ferret toolkit is designed to help system builders quickly construct content-based similarity search systems for feature-rich data types. The key component of the toolkit is a content-based similarity search engine for generic, multi-feature object representations. To solve the similarity search problem in high-dimensional spaces, we have developed approximation methods inspired by recent theoretical results on dimension reduction. The search engine constructs sketches from feature vectors as highly compact data structures for matching, filtering and ranking data objects. The toolkit also includes several other components to help system builders address search system infrastructure issues. We have implemented the toolkit and used it to successfully construct content-based similarity search systems for four data types: audio recordings, digital photos, 3D shape models and genomic microarray data.
- Altavista. http://www.altavista.com.Google Scholar
- Spotlight: Find anything on your mac instantly. http://images.apple.com/macosx/pdf/MacOSX_Spotlight_TB.pdf.Google Scholar
- 3D model retrieval. http://amp.ece.cmu.edu/projects/3DModelRetrieval/.Google Scholar
- 3D model retrieval. http://3d.csie.ntu.edu.tw/~dynamic/.Google Scholar
- 3D model retrieval. http://shape.cs.princeton.edu/search.html.Google Scholar
- A. Berenzweig and D. Ellis. Locating singing voice segments within music signals. In Proc. of IEEE Workshop on Applications of Signal Processing to Acoustics and Audio, October 2001.Google ScholarCross Ref
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the 7th World Wide Web Conference, 1998. Google ScholarDigital Library
- A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer Systems and Sciences, 60(3):630--659, 2000. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. of the Sixth Int. World Wide Web Conf., pages 391--404, 1997. Google ScholarDigital Library
- A. Cardone, S. K. Gupta, and M. Karnik. A survey of shape similarity assessment algorithms for product design and manufacturing applications. Journal of Computing and Information Science in Engineering, 3(2):109--118, 2003.Google ScholarCross Ref
- M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of the 34th Annual ACM Symp. on Theory of Computing, pages 380--388, 2002. Google ScholarDigital Library
- Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2001. Google ScholarDigital Library
- S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff I've seen: A system for personal information retrieval and re-use. In Proc. of the 26th ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 72--79, 2003. Google ScholarDigital Library
- J. P. Eakins and M. e. Graham. Content-based image retrieval: A report to the JISC technology applications programme. Technical report, University of Northumbria at newcastle, Institute for Image Data Research, 1999.Google Scholar
- I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.Google ScholarCross Ref
- J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic-phonetic continuous speech corpus, 1993.Google Scholar
- J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong. Mylifebits: Fulfilling the Memex vision. In Proc. of ACM Multimedia, Conference, pages 235--238, 2002. Google ScholarDigital Library
- K. Grauman and T. Darrell. Fast contour matching using approximate earth mover's distance. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, 2004.Google ScholarCross Ref
- J. Gray and A. S. Szalay. Where the rubber meets the sky: Bridging the gap between databases and science. IEEE Data Engineering Bulletin, 27(4):3--11, December 2004.Google Scholar
- A. Hauptmann, R. Jones, K. Seymore, S. Slattery, M. Witbrock, and M. Siegler. Experiments in information retrieval from spoken documents. In In Proc. of the Broadcast News Transcription and Understanding Workshop, pages 175--181, 1998.Google Scholar
- ftp://db.stanford.edu/pub/wangz/image.vary.jpg.tar.Google Scholar
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, 1998. Google ScholarDigital Library
- P. Indyk and N. Thaper. Fast image retrieval via embeddings. In Proc. of the 3rd Int. Workshop on Statistical and Computational Theories of Vision, 2003.Google Scholar
- N. Iyer, S. Jayanti, K. Lou, Y. Kalyanaraman, and K. Ramani. Three dimensional shape searching: State-of-the-art review and future trends. Computer-Aided Design, 37(5):509--530, 2005. Google ScholarDigital Library
- M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3D shape descriptors. In Proc. of the Eurographics Symposium on Geometry Processing, 2003. Google ScholarDigital Library
- E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal of Computing, 30(2):457--474, 2000. Google ScholarDigital Library
- Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data structures. In Proc. of the 13th ACM Conf. on Information and Knowledge Management, pages 208--217, 2004. Google ScholarDigital Library
- P. Lyman, H. Varian, K. Swaringen, P. Charles, N. Good, L. Jordan, and J. Pal. How much information 2003? http://www.sims.berkeley.edu/research/projects/how-much-info-2003.Google Scholar
- W. Ma and H. Zhang. Benchmarking of image features for content-based retrieval. In Proc. of IEEE 32nd Asilomar Conf. on Signals, Systems, Computers, volume 1, pages 253--257, 1998.Google Scholar
- N. Moreau, H. G. Kim, and T. Sikora. Phone-based spoken document retrieval in conformance with the mpeg-7 standard. Proc. of the Audio Engineering Society 25th Intl. Conf., 2004.Google Scholar
- M. Olson, K. Bostic, and M. Seltzer, Berkeley DB. In Proc. of the 1999 Summer USENIX Technical Conf., June 1999. Google ScholarDigital Library
- L. Rabiner and M. Sambur. An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54:297--315, 1975.Google ScholarCross Ref
- Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover's distance as a metric for image retrieval. Int. Journal of Computer Vision, 40(2):99--121, 2000. Google ScholarDigital Library
- Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions and open issues. J. of Visual Communication and Image Representation, 10(4):39--62, 1999.Google ScholarDigital Library
- R. Schettini, G. Ciocca, and S. Zuffi. A survey of methods for color image indexing and retrieval in image databases. Color Imaging Science: Exploiting Digital Media, 2001.Google Scholar
- P. Shilane, M. Kazhdan, P. Min, and T. Funkhouser. The Princeton shape benchmark. In Proc. of the Conf. on Shape Modeling and Applications, 2004. Google ScholarDigital Library
- M. Siegler and M. Witbrock. Improving the suitability of imperfect transcriptions for information retrieval of spoken documents. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1999. Google ScholarDigital Library
- Sets of Similar Images. http://dbvis.inf.unikonstanz.de/research/projects/SimSearch/effpics.html.Google Scholar
- A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-base image retrieval at the end of the early years. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(12), 2000. Google ScholarDigital Library
- A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349--1380, 2000. Google ScholarDigital Library
- G. Tzanetakis and P. Cook. MARSYAS: A Framework for Audio Analysis. Cambridge University Press, 2000.Google ScholarDigital Library
- G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), July 2002.Google ScholarCross Ref
- R. C. Veltkamp. Shape matching: Similarity measures and algorithms. In Proc. of the Int. Conf. on Shape Modeling & Applications, page 188, 2001. Google ScholarDigital Library
- R. C. Veltkamp and M. Tanase. Content-base image retrieval systems: A survey, Technical Report UU-CS-2000-34, Utrecht University, Information and Computer Sciences, 2000.Google Scholar
- J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(9):947--963, 2001. Google ScholarDigital Library
Index Terms
- Ferret: a toolkit for content-based similarity search of feature-rich data
Recommendations
Sizing sketches: a rank-based analysis for similarity search
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsSketches are compact data structures that can be used to estimate properties of the original data in building large-scale search engines and data analysis systems. Recent theoretical and experimental studies have shown that sketches constructed from ...
Sizing sketches: a rank-based analysis for similarity search
SIGMETRICS '07 Conference ProceedingsSketches are compact data structures that can be used to estimate properties of the original data in building large-scale search engines and data analysis systems. Recent theoretical and experimental studies have shown that sketches constructed from ...
Ferret: a toolkit for content-based similarity search of feature-rich data
Proceedings of the 2006 EuroSys conferenceBuilding content-based search tools for feature-rich data has been a challenging problem because feature-rich data such as audio recordings, digital images, and sensor data are inherently noisy and high dimensional. Comparing noisy data requires ...
Comments