skip to main content
10.1145/3297280.3297330acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

High-performance probabilistic record linkage via multi-dimensional homomorphisms

Published: 08 April 2019 Publication History

Abstract

Probabilistic Record Linkage (PRL) identifies data records referring to the same real-world entity, e.g., in a database. PRL is increasingly used in epidemiology centers, intelligence agencies, and universities. However, PRL is a time-consuming task, which limits its applicability for large data sets in real-world applications.
We address the problem of accelerating PRL by parallelizing it for modern high-performance architectures, such as multi-core CPU and many-core GPU. Our approach relies on the formalism of Multi-Dimensional Homomorphisms (MDHs) - a class of functions with a generic parallel implementation in OpenCL. The schema allows for automatic optimization for a particular target hardware architecture by exploiting the auto-tuning approach. Our experiments show that we achieve significantly better performance on both CPU and GPU - speedups of up to 80 times - as compared to the parallel implementation of PRL that is currently used by EKR - the largest cancer registry in Europa.

References

[1]
Murilo Boratto, Pedro Alonso, Clicia Pinto, Pedro Melo, Marcos Barreto, and Spiros Denaxas. 2018. Exploring hybrid parallel systems for probabilistic record linkage. The Journal of Supercomputing (2018).
[2]
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
[3]
A. K. Elmagarmid et al. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 16 pp
[4]
Ivan P. Fellegi et al. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc., 1183--1210.
[5]
Benedikt Forchhammer et al. 2013. Duplicate Detection on GPUs. HPI Future SOC Lab 70, 3 pp.
[6]
Sergei Gorlatch. 1999. Extracting and implementing list homomorphisms in parallel program development. Science of Computer Programming 33, 27 pp.
[7]
Sergei Gorlatch and Murray Cole. 2011. Parallel Skeletons. Encyclopedia of Parallel Computing, 1417--1422.
[8]
K Hentschel et al. 2008. Das Krebsregister-Manual der Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V. Zuckschwerdt Verlag.
[9]
Intel. 2014. OpenCL Optimization Guide. https://software.intel.com/sites/default/files/managed/72/2c/gfxOptimizationGuide.pdf
[10]
Kazuhiko Komatsu et al. 2010. Evaluating Performance and Portability of OpenCL Programs. In Workshop on Automatic Performance Tuning. 15 pp.
[11]
Victor W. Lee et al. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. SIGARCH Comput. Archit. News, 451--460.
[12]
Axel-Cyrille Ngonga Ngomo et al. 2013. When to Reach for the Cloud: Using Parallel Hardware for Link Discovery. In Extended Semantic Web Conference. Springer, 275--289.
[13]
John Nickolls et al. 2008. Scalable Parallel Programming with CUDA (ACM SIGGRAPH). 14 pp.
[14]
NVIDIA. 2009. NVIDIA OpenCL Best Practices Guide. https://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf
[15]
Ari Rasch and Sergei Gorlatch. 2018. ATF: A Generic, Directive-Based Auto-Tuning Framework. Concurrency and Computation: Practice and Experience, 16 pp.
[16]
Ari Rasch and Sergei Gorlatch. 2018. Multi-dimensional Homomorphisms and Their Implementation in OpenCL. International Journal of Parallel Programming (2018), 101--119.
[17]
Ziad Sehili et al. 2015. Privacy Preserving Record Linkage with PPJoin. In Datenbanksysteme für Business, Technologie und Web (BTW 2015). Gesellschaft für Informatik e.V., 85--104.
[18]
John E Stone et al. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering.
[19]
Dinusha Vatsalan et al. 2017. Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges. Springer, 851--895.
[20]
X. Zhang et al. 2013. A Privacy Leakage Upper Bound Constraint-Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud. IEEE Transactions on Parallel and Distributed Systems, 1192--1202.
[21]
Xuyun Zhang et al. 2013. An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud. J. Comput. System Sci., 542--555.

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
  • (2021)Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)ACM Transactions on Architecture and Code Optimization10.1145/342709318:1(1-26)Online publication date: 20-Jan-2021

Index Terms

  1. High-performance probabilistic record linkage via multi-dimensional homomorphisms

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing
      April 2019
      2682 pages
      ISBN:9781450359337
      DOI:10.1145/3297280
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 April 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Conference

      SAC '19
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

      Upcoming Conference

      SAC '25
      The 40th ACM/SIGAPP Symposium on Applied Computing
      March 31 - April 4, 2025
      Catania , Italy

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
      • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
      • (2021)Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)ACM Transactions on Architecture and Code Optimization10.1145/342709318:1(1-26)Online publication date: 20-Jan-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media