skip to main content
10.1145/1996092.1996101acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Published:08 June 2011Publication History

ABSTRACT

The last decade has seen a surge of interest in large-scale data-parallel processing engines. While these engines share many features in common with parallel databases, they make a set of different trade-offs. In consequence many of the lessons learned for programming parallel databases have to be re-learned in the new environment. In this paper we show a case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). We focus on the challenges raised by the nature of large data sets and data skew and show how they can be addressed effectively within this computation framework by optimizing the computation to adapt to the nature of the data. In particular we describe three different strategies for performing distributed joins and show how the platform language allows us to implement optimization strategies at the application level, without system support. We show that this flexibility in the programming model allows for a highly effective system, providing a measured speedup of more than 100 on 64 machines (256 cores), and an estimated speedup of 200 on 1280 machines (5120 cores)of matching 4 million offers.

References

  1. Apache Hadoop.Google ScholarGoogle Scholar
  2. P. B. R. R. Arnab Nandi, Cong Yu. Distributed cube materialization on holistic measures. Hanover, Germany, 2011.Google ScholarGoogle Scholar
  3. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intel ligent Systems, 18:16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Chaiken, P.-øA. L. Bob Jenkins, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. February 2008.Google ScholarGoogle Scholar
  5. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. FlumeJava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Chu, H. Tang, T. Yang, and K. Shen. Optimizing data aggregation for cluster-based internet services. In Symposium on Principles and practice of paral lel programming (PPoPP), pages 119--130. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. pages 201--212, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. San Francisco, CA, December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB '92, pages 27--40, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. I. Forrester Research. http://techcrunch.com/2010/03/08/forrester-forecast-online-retail-sales-will-grow-to-250-billion-by-2014/, 2011.Google ScholarGoogle Scholar
  11. G. Fulgoni. The 2010 u.s. digital year in review. Technical report, Comscore, 2010.Google ScholarGoogle Scholar
  12. Y. Gu and R. Grossman. Sector and Sphere: The design and implementation of a high performance data cloud. Philosophical Transactions of the Royal Society, 367(1897):2429--2445, June 2009.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 525--535, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Lisbon, Portugal, March 21-23 2007. also as MSR-TR-2006-140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In J. N. Matthews and T. E. Anderson, editors, ACM Symposium on Operating Systems Principles (SOSP), pages 261--276, Big Sky, Montana, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kannan, I. E.Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product descriptions. Technical Report MSR-TR-2010-172, Microsoft. Engineering (ICDE), pages 996--1005, Long Beach, California, March 1--6, 2010 2010.Google ScholarGoogle Scholar
  17. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 75--86, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Monge and C. Elkan. The field matching problem: Algorithms and applications. In In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 267--270, 1996.Google ScholarGoogle Scholar
  20. Nielsen. Global trends in online shopping: A nielsen global consumer report. Technical report, June 2010.Google ScholarGoogle Scholar
  21. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (Industrial Track) (SIGMOD), Vancouver, Canada, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In International Conference on Data Engineering (ICDE), pages 996--1005, Long Beach, California, March 1-6, 2010 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 537--548, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.Google ScholarGoogle Scholar
  25. J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Washington, DC, USA, 1991. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. page 14, San Diego, CA, December 8-10 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum. Botgraph: large scale spamming botnet detection. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 321--334, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MapReduce '11: Proceedings of the second international workshop on MapReduce and its applications
        June 2011
        82 pages
        ISBN:9781450307000
        DOI:10.1145/1996092

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 June 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader