skip to main content
10.1145/1274971.1274975acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Scalability of the Nutch search engine

Published: 17 June 2007 Publication History

Abstract

Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective.

References

[1]
M. Cafarella and D. Cutting. Building Nutch: open source search. ACM Queue. Vol. 2, no. 2, pp. 54--61, 2004.
[2]
M. Cafarella and O. Etzioni. A search engine for natural language applications. WWW '05: Proceedings of the 14th International World Wide Web Conference. Chiba, Japan. 2005. pp. 442--452.
[3]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04). San Francisco, CA, December, 2004.
[4]
D. M. Desai, T. M. Bradicich, D. Champion, W. G. Holland, and B. M. Kreuz. BladeCenter system overview. IBM Journal of Research and Development. Vol. 49, no. 6. 2005.
[5]
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal. Vol. 13, no 1, 2006 pp. 64--77.
[6]
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications. 2004.
[7]
J. E. Hughes, P. S. Patel, I. R. Zapata, T. D. Pahel, Jr., J. P. Wong, D. M. Desai, and B. D. Herrman. BladeCenter midplane and media interface card. IBM Journal of Research and Development. Vol. 49, no. 6. 2005.
[8]
J. E. Hughes, M. L. Scollard, R. Land, J. Parsonese, C. C. West, V. A. Stankevich, C. L. Purrington, D. Q. Hoang, G. R. Shippy, M. L. Loeb, M. W. Williams, B. A. Smith, and D. M. Desai. BladeCenter processor blades, I/O expansion adapters, and units. IBM Journal of Research and Development. Vol. 49, no. 6. 2005.
[9]
Z. Liu, C. H. Xia, P. Momcilovic, and L. Zhang. AMBIENCE: Automatic Model Building using InfErENCE. In Congress MSR03, Metz, France, Oct. 2003.
[10]
H. M. Mathis, J. D. McCalpin, and J. Thomas. IBM p5 575 ultra-dense, modular cluster node for high performance computing. IBM Systems and Technology Group. October 2005.
[11]
H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel. Characterization of simultaneous multithreading (SMT) efficiency in POWER5. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005.
[12]
M. Michael, J. E. Moreira, D. Shiloach, and R. Wisniewski. Scale-up x Scale-out: A Case Study using Nutch/Lucene. Third International Workshop on System Management Techniques, Processes, and Services (SMTPS). Held in conjunction with the 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007). Long Beach, CA, March 30th, 2007.
[13]
F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. First Conference on File and Storage Technologies (FAST). pp. 231--244. January 2002.
[14]
B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005.
[15]
University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputer sites. http://www.top500.org/.
[16]
E. Varki. Mean value technique for closed fork-join networks. In ACM Sigmetrics, pages 103--112, 1999.
[17]
S. Weisberg. Applied Linear Regression, 3rd Edition. Wiley, 2005.
[18]
L. Zhang, C. Xia, M. S. Squillante, and W. N. Mills III. Web workload service requirement analysis: A queueing network approach. In MASCOTS, 2002.

Cited By

View all
  • (2021)Scalability and Robustness Testing for Open Source Web Crawlers2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering10.1109/ECTIDAMTNCON51128.2021.9425701(197-201)Online publication date: 3-Mar-2021
  • (2018)Model-driven optimal resource scaling in cloudSoftware and Systems Modeling (SoSyM)10.1007/s10270-017-0584-y17:2(509-526)Online publication date: 1-May-2018
  • (2017)Integrated Hadoop Cloud Framework (IHCF)Indian Journal of Science and Technology10.17485/ijst/2017/v10i10/10794310:10(1-8)Online publication date: 2-Feb-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '07: Proceedings of the 21st annual international conference on Supercomputing
June 2007
315 pages
ISBN:9781595937681
DOI:10.1145/1274971
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Lucene
  2. Nutch
  3. query
  4. scale-out
  5. search

Qualifiers

  • Article

Conference

ICS07
Sponsor:
ICS07: International Conference on Supercomputing
June 17 - 21, 2007
Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Scalability and Robustness Testing for Open Source Web Crawlers2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering10.1109/ECTIDAMTNCON51128.2021.9425701(197-201)Online publication date: 3-Mar-2021
  • (2018)Model-driven optimal resource scaling in cloudSoftware and Systems Modeling (SoSyM)10.1007/s10270-017-0584-y17:2(509-526)Online publication date: 1-May-2018
  • (2017)Integrated Hadoop Cloud Framework (IHCF)Indian Journal of Science and Technology10.17485/ijst/2017/v10i10/10794310:10(1-8)Online publication date: 2-Feb-2017
  • (2016)Efficient parallelised search engine based on virtual clusterInternational Journal of Computational Science and Engineering10.1504/IJCSE.2016.07455712:1(53-57)Online publication date: 1-Feb-2016
  • (2014)Modeling the Impact of Workload on Cloud Resource ScalingProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.16(310-317)Online publication date: 22-Oct-2014
  • (2013)ODYSProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2465316(313-324)Online publication date: 22-Jun-2013
  • (2013)Accelerating text mining workloads in a MapReduce-based distributed GPU environmentJournal of Parallel and Distributed Computing10.1016/j.jpdc.2012.10.00173:2(198-206)Online publication date: 1-Feb-2013
  • (2011)An innovative approach to the development of E-government search servicesProceedings of the Second international conference on Electronic government and the information systems perspective10.5555/2033665.2033671(41-55)Online publication date: 29-Aug-2011
  • (2011)Heuristics-Based Query Processing for Large RDF Graphs Using Cloud ComputingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2011.10323:9(1312-1327)Online publication date: 1-Sep-2011
  • (2011)An Innovative Approach to the Development of E-Government Search ServicesElectronic Government and the Information Systems Perspective10.1007/978-3-642-22961-9_4(41-55)Online publication date: 2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media