ABSTRACT
To improve the effectiveness of pair-wise similarity computation, state-of-the-art approaches assign objects to multiple overlapping clusters. This introduces redundant pair comparisons when similar objects share more than one cluster. We propose an approach that eliminates such redundant comparisons and that can be easily integrated into existing MapReduce implementations. We evaluate the approach on a real cloud infrastructure and show its effectiveness for all degrees of redundancy.
- R. Baraglia, G. D. F. Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In ICDM, 2010. Google ScholarDigital Library
- P. Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Trans. Knowl. Data Eng., 24(9), 2012. Google ScholarDigital Library
- J. Ekanayake, T. Gunarathne, and J. Qiu. Cloud Technologies for Bioinformatics Applications. IEEE Trans. Parallel Distrib. Syst., 22(6), 2011. Google ScholarDigital Library
- T. Elsayed, J. J. Lin, and D. W. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL (Short Papers), 2008. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB, 5(12), 2012. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Load Balancing for MapReduce-based Entity Resolution. In ICDE, 2012. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - R&D, 27(1), 2012. Google ScholarDigital Library
- H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), 2010. Google ScholarDigital Library
- N. McNeill, H. Kardes, and A. Borthwick. Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce. In QDB, 2012.Google Scholar
- M. Mendes and L. Sacks. Evaluating fuzzy clustering for relevance-based information access. In IEEE FUZZ, volume 1, 2003.Google ScholarCross Ref
- C. Moretti, H. Bui, K. Hollingsworth, et al. All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids. IEEE Trans. Parallel Distrib. Syst., 21(1), 2010. Google ScholarDigital Library
- G. Papadakis, E. Ioannou, C. Niederée, et al. Eliminating the Redundancy in Blocking-based Entity Resolution Methods. In JCDL, 2011. Google ScholarDigital Library
- G. Papadakis and W. Nejdl. Efficient Entity Resolution for Large Heterogeneous Information Spaces. In ICDE Workshops, 2011. Google ScholarDigital Library
- M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), 2009. Google ScholarDigital Library
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In Sigmod, 2010. Google ScholarDigital Library
- C. Xiao, W. Wang, X. Lin, et al. Efficient Similarity Joins for Near-Duplicate Detection. In WWW, 2008. Google ScholarDigital Library
Index Terms
- Don't match twice: redundancy-free similarity computation with MapReduce
Recommendations
An efficient MapReduce algorithm for similarity join in metric spaces
Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, ...
Dimension independent similarity computation
We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from ...
Efficient top-k similarity document search utilizing distributed file systems and cosine similarity
Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such ...
Comments