ABSTRACT
Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The "Apriori" algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized.
While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever-growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms.
A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons.
We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.
- Anna L. Buczak and Christopher M. Gifford. Fuzzy Association Rule Mining for Community Crime Pattern Discovery. In ISI-KDD 2010, ACM, USA, 2010. Google ScholarDigital Library
- Apache hadoop. http://hadoop.apache.org/2013.Google Scholar
- Datasets. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.Google Scholar
- FIMI Datasets. http://fimi.ua.ac.be/data/Google Scholar
- Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. In 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops, 2014. Google ScholarDigital Library
- Honglie Yu, Jun Wen, and Hongmei Wang. An Improved Apriori Algorithm Based On the Boolean Matrix and Hadoop. In International Conference on Advanced in Control Engineering and Information Science (CEIS), pp. 1827--1831, 2011.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI. USENIX Association, 2004. Google ScholarDigital Library
- J. Han, H. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proc. Conf. on the Management of Data (SIGMOD'00, Dallas, TX), ACM Press, New York, NY, USA 2000. Google ScholarDigital Library
- Jian Guo. Research on Improved Apriori Algorithm Based on Coding and MapReduce. In 10th Web Information System and Application Conference, 294--299, 2013. Google ScholarDigital Library
- Lan Vu and Gita Alaghband. Novel Parallel Method for Mining Frequent Patterns on Multi-core Shared Memory Systems. In ACM conference, Denver USA, 49--54, 2013. Google ScholarDigital Library
- Latifur Khan, Mamoun Awad and Bhavani Thuraisingham. A new intrusion detection system using support vector machines and hierarchical clustering. In VLDB Journal 2007, pp: 507--521, 2007. Google ScholarDigital Library
- Li, N., Zeng, L., He, Q. & Shi, Z. Parallel Implementation of Apriori Algorithm Based on MapReduce. In Proc. of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD '12), Kyoto, IEEE: 236--241, 2012. Google ScholarDigital Library
- Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proc. of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12), New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarDigital Library
- Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proceedings of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12). New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, USENIX Association Berkeley, CA, USA, 2010. Google ScholarDigital Library
- Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li. New algorithms for fast discovery of association rules. Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627. 1997. Google ScholarDigital Library
- Ning Li, Li Zang, Qing He and Zhongzhi Shi. Parallel Implementation of Apriori Algorithm Based on MapReduce. In International Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), pp. 89--96. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceeding of the 20th International Conference on VLDB, pp. 478--499, 1994. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. VLDB, pages 487--499, 1994. Google ScholarDigital Library
- Tong Wang, Cynthia Rudin, Daniel Wagner and Rich Sevieri. Learning to Detect Patterns of Crime. In Springer, MIT, USA, 2013.Google Scholar
- Yang, X. Y., Liu, Z. & Fu, Y. MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. In Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS '10), Chengdu, China, IEEE: 99--102, 2010.Google Scholar
- Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. Crowd Mining. In SIGMOD'13, USA, 2013. Google ScholarDigital Library
- Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. CrowdMiner: Mining association Rules from the crowd.In Proceedings of VLDB Endowment, 2013. Google ScholarDigital Library
- Zahra Farzanyar and Nick Cercone. Efficient Mining of Frequent Itemsets in Social Network Data based on Mapreduce Framework. In 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, 1183--1188, 2013. Google ScholarDigital Library
Index Terms
- R-Apriori: An Efficient Apriori based Algorithm on Spark
Recommendations
Association Rule Mining using Apriori for Large and Growing Datasets under Hadoop
ICNCC '17: Proceedings of the 2017 VI International Conference on Network, Communication and ComputingThe time consumed by Apriori algorithm to process transactional databases for frequent item set mining grows exponentially with the dimensions in the database. The standard rule mining procedure based on Apriori is suboptimal with respect to large ...
A New Algorithm for Frequent Itemsets Mining Based on Apriori and FP-Tree
GCIS '09: Proceedings of the 2009 WRI Global Congress on Intelligent Systems - Volume 02Frequent item sets mining plays an important role in association rules mining. The apriori algorithm and the FP-growth algorithm are the most famous algorithms, existing frequent item sets mining algorithms are almost improved based on the two ...
A Spark-based Incremental Algorithm for Frequent Itemset Mining
BDIOT '18: Proceedings of the 2018 2nd International Conference on Big Data and Internet of ThingsAssociation rule mining plays an important role in many areas, including market basket analysis, intrusion detection, bioinformatics and so on. As an efficient approach of finding frequent itemset among large datasets, several parallel Apriori-based ...
Comments