research-article

Don't match twice: redundancy-free similarity computation with MapReduce

Authors:
Lars Kolb

University of Leipzig

University of Leipzig
View Profile

,
Andreas Thor

University of Leipzig

University of Leipzig
View Profile

,
Erhard Rahm

University of Leipzig

University of Leipzig
View Profile

DanaC '13: Proceedings of the Second Workshop on Data Analytics in the CloudJune 2013Pages 1–5https://doi.org/10.1145/2486767.2486768

Published:23 June 2013Publication History

DanaC '13: Proceedings of the Second Workshop on Data Analytics in the Cloud

Pages 1–5

ABSTRACT

To improve the effectiveness of pair-wise similarity computation, state-of-the-art approaches assign objects to multiple overlapping clusters. This introduces redundant pair comparisons when similar objects share more than one cluster. We propose an approach that eliminates such redundant comparisons and that can be easily integrated into existing MapReduce implementations. We evaluate the approach on a real cloud infrastructure and show its effectiveness for all degrees of redundancy.

References

R. Baraglia, G. D. F. Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In ICDM, 2010. Google ScholarDigital Library
P. Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Trans. Knowl. Data Eng., 24(9), 2012. Google ScholarDigital Library
J. Ekanayake, T. Gunarathne, and J. Qiu. Cloud Technologies for Bioinformatics Applications. IEEE Trans. Parallel Distrib. Syst., 22(6), 2011. Google ScholarDigital Library
T. Elsayed, J. J. Lin, and D. W. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL (Short Papers), 2008. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB, 5(12), 2012. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Load Balancing for MapReduce-based Entity Resolution. In ICDE, 2012. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - R&D, 27(1), 2012. Google ScholarDigital Library
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), 2010. Google ScholarDigital Library
N. McNeill, H. Kardes, and A. Borthwick. Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce. In QDB, 2012.Google Scholar
M. Mendes and L. Sacks. Evaluating fuzzy clustering for relevance-based information access. In IEEE FUZZ, volume 1, 2003.Google ScholarCross Ref
C. Moretti, H. Bui, K. Hollingsworth, et al. All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids. IEEE Trans. Parallel Distrib. Syst., 21(1), 2010. Google ScholarDigital Library
G. Papadakis, E. Ioannou, C. Niederée, et al. Eliminating the Redundancy in Blocking-based Entity Resolution Methods. In JCDL, 2011. Google ScholarDigital Library
G. Papadakis and W. Nejdl. Efficient Entity Resolution for Large Heterogeneous Information Spaces. In ICDE Workshops, 2011. Google ScholarDigital Library
M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), 2009. Google ScholarDigital Library
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In Sigmod, 2010. Google ScholarDigital Library
C. Xiao, W. Wang, X. Lin, et al. Efficient Similarity Joins for Near-Duplicate Detection. In WWW, 2008. Google ScholarDigital Library

Index Terms

Don't match twice: redundancy-free similarity computation with MapReduce
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

An efficient MapReduce algorithm for similarity join in metric spaces

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, ...
Read More
Dimension independent similarity computation

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from ...
Read More
Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DanaC '13: Proceedings of the Second Workshop on Data Analytics in the Cloud
June 2013
49 pages
ISBN:9781450322027
DOI:10.1145/2486767
Conference Chairs:
Kostas Tzoumas
Technische Universität Berlin
,
Shivnath Babu
Duke University
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
pair-wise similarity computation
redundancy
Qualifiers
- research-article
Conference

Acceptance Rates
DanaC '13 Paper Acceptance Rate9of16submissions,56%Overall Acceptance Rate19of34submissions,56%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 219
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Don't match twice: redundancy-free similarity computation with MapReduce

DanaC '13: Proceedings of the Second Workshop on Data Analytics in the Cloud

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient MapReduce algorithm for similarity join in metric spaces

Dimension independent similarity computation

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Don't match twice: redundancy-free similarity computation with MapReduce

DanaC '13: Proceedings of the Second Workshop on Data Analytics in the Cloud

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient MapReduce algorithm for similarity join in metric spaces

Dimension independent similarity computation

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media