research-article

R-Apriori: An Efficient Apriori based Algorithm on Spark

Authors:
Sanjay Rathee

Indian Institute of Technology Mandi, Mandi, India

Indian Institute of Technology Mandi, Mandi, India
View Profile

,
Manohar Kaul

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Arti Kashyap

Indian Institute of Technology Mandi, Mandi, India

Indian Institute of Technology Mandi, Mandi, India
View Profile

PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge ManagementOctober 2015Pages 27–34https://doi.org/10.1145/2809890.2809893

Published:18 October 2015Publication History

PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

Pages 27–34

ABSTRACT

Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The "Apriori" algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized.

While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever-growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms.

A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons.

We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.

References

Anna L. Buczak and Christopher M. Gifford. Fuzzy Association Rule Mining for Community Crime Pattern Discovery. In ISI-KDD 2010, ACM, USA, 2010. Google ScholarDigital Library
Apache hadoop. http://hadoop.apache.org/2013.Google Scholar
Datasets. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.Google Scholar
FIMI Datasets. http://fimi.ua.ac.be/data/Google Scholar
Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. In 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops, 2014. Google ScholarDigital Library
Honglie Yu, Jun Wen, and Hongmei Wang. An Improved Apriori Algorithm Based On the Boolean Matrix and Hadoop. In International Conference on Advanced in Control Engineering and Information Science (CEIS), pp. 1827--1831, 2011.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI. USENIX Association, 2004. Google ScholarDigital Library
J. Han, H. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proc. Conf. on the Management of Data (SIGMOD'00, Dallas, TX), ACM Press, New York, NY, USA 2000. Google ScholarDigital Library
Jian Guo. Research on Improved Apriori Algorithm Based on Coding and MapReduce. In 10th Web Information System and Application Conference, 294--299, 2013. Google ScholarDigital Library
Lan Vu and Gita Alaghband. Novel Parallel Method for Mining Frequent Patterns on Multi-core Shared Memory Systems. In ACM conference, Denver USA, 49--54, 2013. Google ScholarDigital Library
Latifur Khan, Mamoun Awad and Bhavani Thuraisingham. A new intrusion detection system using support vector machines and hierarchical clustering. In VLDB Journal 2007, pp: 507--521, 2007. Google ScholarDigital Library
Li, N., Zeng, L., He, Q. & Shi, Z. Parallel Implementation of Apriori Algorithm Based on MapReduce. In Proc. of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD '12), Kyoto, IEEE: 236--241, 2012. Google ScholarDigital Library
Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proc. of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12), New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarDigital Library
Lin, M., Lee, P. & Hsueh, S. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proceedings of the 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12). New York, NY, USA, ACM: Article No. 76, 2012. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, USENIX Association Berkeley, CA, USA, 2010. Google ScholarDigital Library
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li. New algorithms for fast discovery of association rules. Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627. 1997. Google ScholarDigital Library
Ning Li, Li Zang, Qing He and Zhongzhi Shi. Parallel Implementation of Apriori Algorithm Based on MapReduce. In International Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), pp. 89--96. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceeding of the 20th International Conference on VLDB, pp. 478--499, 1994. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. VLDB, pages 487--499, 1994. Google ScholarDigital Library
Tong Wang, Cynthia Rudin, Daniel Wagner and Rich Sevieri. Learning to Detect Patterns of Crime. In Springer, MIT, USA, 2013.Google Scholar
Yang, X. Y., Liu, Z. & Fu, Y. MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. In Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS '10), Chengdu, China, IEEE: 99--102, 2010.Google Scholar
Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. Crowd Mining. In SIGMOD'13, USA, 2013. Google ScholarDigital Library
Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart. CrowdMiner: Mining association Rules from the crowd.In Proceedings of VLDB Endowment, 2013. Google ScholarDigital Library
Zahra Farzanyar and Nick Cercone. Efficient Mining of Frequent Itemsets in Social Network Data based on Mapreduce Framework. In 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, 1183--1188, 2013. Google ScholarDigital Library

Index Terms

R-Apriori: An Efficient Apriori based Algorithm on Spark
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
  2. Information systems applications

Recommendations

Association Rule Mining using Apriori for Large and Growing Datasets under Hadoop
ICNCC '17: Proceedings of the 2017 VI International Conference on Network, Communication and Computing

The time consumed by Apriori algorithm to process transactional databases for frequent item set mining grows exponentially with the dimensions in the database. The standard rule mining procedure based on Apriori is suboptimal with respect to large ...
Read More
A New Algorithm for Frequent Itemsets Mining Based on Apriori and FP-Tree
GCIS '09: Proceedings of the 2009 WRI Global Congress on Intelligent Systems - Volume 02

Frequent item sets mining plays an important role in association rules mining. The apriori algorithm and the FP-growth algorithm are the most famous algorithms, existing frequent item sets mining algorithms are almost improved based on the two ...
Read More
A Spark-based Incremental Algorithm for Frequent Itemset Mining
BDIOT '18: Proceedings of the 2018 2nd International Conference on Big Data and Internet of Things

Association rule mining plays an important role in many areas, including market basket analysis, intrusion detection, bioinformatics and so on. As an efficient approach of finding frequent itemset among large datasets, several parallel Apriori-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management
October 2015
54 pages
ISBN:9781450337823
DOI:10.1145/2809890
Program Chairs:
Mouna Kacimi
Free University of Bozen-Bolzano, Italy
,
Nicoleta Preda
Versailles University, France
,
Maya Ramanath
Indian Institute of Technology Delhi, India
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
apriori
frequent itemset mining
hadoop
mapreduce
r-apriori
spark
Qualifiers
- research-article
Conference

Acceptance Rates
PIKM '15 Paper Acceptance Rate5of16submissions,31%Overall Acceptance Rate25of62submissions,40%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 1,005
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

R-Apriori: An Efficient Apriori based Algorithm on Spark

PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Association Rule Mining using Apriori for Large and Growing Datasets under Hadoop

A New Algorithm for Frequent Itemsets Mining Based on Apriori and FP-Tree

A Spark-based Incremental Algorithm for Frequent Itemset Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

R-Apriori: An Efficient Apriori based Algorithm on Spark

PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Association Rule Mining using Apriori for Large and Growing Datasets under Hadoop

A New Algorithm for Frequent Itemsets Mining Based on Apriori and FP-Tree

A Spark-based Incremental Algorithm for Frequent Itemset Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media