skip to main content
10.1145/2809890.2809893acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

R-Apriori: An Efficient Apriori based Algorithm on Spark

Published:18 October 2015Publication History

ABSTRACT

Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The "Apriori" algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized.

While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever-growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms.

A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons.

We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.

References

  1. Anna L. Buczak and Christopher M. Gifford. Fuzzy Association Rule Mining for Community Crime Pattern Discovery. In ISI-KDD 2010, ACM, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache hadoop. http://hadoop.apache.org/2013.Google ScholarGoogle Scholar
  3. Datasets. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.Google ScholarGoogle Scholar
  4. FIMI Datasets. http://fimi.ua.ac.be/data/Google ScholarGoogle Scholar
  5. Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. In 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Honglie Yu, Jun Wen, and Hongmei Wang. An Improved Apriori Algorithm Based On the Boolean Matrix and Hadoop. In International Conference on Advanced in Control Engineering and Information Science (CEIS), pp. 1827--1831, 2011.Google ScholarGoogle Scholar
  7. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI. USENIX Association, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Han, H. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proc. Conf. on the Management of Data (SIGMOD'00, Dallas, TX), ACM Press, New York, NY, USA 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jian Guo. Research on Improved Apriori Algorithm Based on Coding and MapReduce. In 10th Web Information System and Application Conference, 294--299, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lan Vu and Gita Alaghband. Novel Parallel Method for Mining Frequent Patterns on Multi-core Shared Memory Systems. In ACM conference, Denver USA, 49--54, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Latifur Khan, Mamoun Awad and Bhavani Thuraisingham. A new intrusion detection system using support vector machines and hierarchical clustering. In VLDB Journal 2007, pp: 507--521, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Li, N., Zeng, L., He, Q. & Shi, Z. Parallel Implementation of Apriori Algorithm Based on MapReduce. In Proc. of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD '12), Kyoto, IEEE: 236--241, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proc. of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12), New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proceedings of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12). New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, USENIX Association Berkeley, CA, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li. New algorithms for fast discovery of association rules. Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ning Li, Li Zang, Qing He and Zhongzhi Shi. Parallel Implementation of Apriori Algorithm Based on MapReduce. In International Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), pp. 89--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceeding of the 20th International Conference on VLDB, pp. 478--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. VLDB, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tong Wang, Cynthia Rudin, Daniel Wagner and Rich Sevieri. Learning to Detect Patterns of Crime. In Springer, MIT, USA, 2013.Google ScholarGoogle Scholar
  21. Yang, X. Y., Liu, Z. & Fu, Y. MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. In Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS '10), Chengdu, China, IEEE: 99--102, 2010.Google ScholarGoogle Scholar
  22. Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. Crowd Mining. In SIGMOD'13, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. CrowdMiner: Mining association Rules from the crowd.In Proceedings of VLDB Endowment, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zahra Farzanyar and Nick Cercone. Efficient Mining of Frequent Itemsets in Social Network Data based on Mapreduce Framework. In 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, 1183--1188, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. R-Apriori: An Efficient Apriori based Algorithm on Spark

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management
        October 2015
        54 pages
        ISBN:9781450337823
        DOI:10.1145/2809890

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 October 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PIKM '15 Paper Acceptance Rate5of16submissions,31%Overall Acceptance Rate25of62submissions,40%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader