skip to main content
10.1145/2566486.2568034acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Deduplicating a places database

Published: 07 April 2014 Publication History

Abstract

We consider the problem of resolving duplicates in a database of places, where a place is defined as any entity that has a name and a physical location. When other auxiliary attributes like phone and full address are not available, deduplication based solely on names and approximate location becomes an exceptionally challenging problem that requires both domain knowledge as well an local geographical knowledge. For example, the pairs "Newpark Mall Gap Outlet" and "Newpark Mall Sears Outlet" have a high string similarity, but determining that they are different requires the domain knowledge that they represent two different store names in the same mall. Similarly, in most parts of the world, a local business called "Central Park Cafe" might simply be referred to by "Central Park", except in New York, where the keyword "Cafe" in the name becomes important to differentiate it from the famous park in the city.
In this paper, we present a language model that can encapsulate both domain knowledge as well as local geographical knowledge. We also present unsupervised techniques that can learn such a model from a database of places. Finally, we present deduplication techniques based on such a model, and we demonstrate, using real datasets, that our techniques are much more effective than simple TF-IDF based models in resolving duplicates. Our techniques are used in production at Facebook for deduplicating the Places database.

References

[1]
L. Backstrom, E. Sun, and C. Marlow. Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on World wide web, pages 61--70. ACM, 2010.
[2]
K. Bellare, C. Curino, A. Machanavajihala, P. Mika, M. Rahurkar, and A. Sane. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis. Proceedings of the VLDB Endowment, 6(11), 2013.
[3]
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. Intelligent Systems, IEEE, 18(5):16--23, 2003.
[4]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 39--48. ACM, 2003.
[5]
J. Chang and E. Sun. Location3: How users share and respond to location-based data on social networking sites. In Proceedings of the International Conference on the Weblogs and Social Media (ICWSM'11), 2011.
[6]
W. W. Cohen, N. Glance, C. Schafer, R. Tromble, and Y. W. Wong. Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2011.
[7]
W. W. Cohen, P. D. Ravikumar, S. E. Fienberg, et al. A comparison of string distance metrics for name-matching tasks. In IIWeb, volume 2003, pages 73--78, 2003.
[8]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, 19(1):1--16, 2007.
[9]
A. Koriakine and E. Saveliev. Wikimapia. Online: wikimapia.org, 2008.
[10]
A. McCallum, K. Bellare, and F. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. arXiv preprint arXiv:1207.1406, 2012.
[11]
J. Oncina and M. Sebban. Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition, 39(9):1575--1587, 2006.
[12]
B. W. Parkinson. Gps error analysis. Global Positioning System: Theory and applications., 1:469--483, 1996.
[13]
E. S. Ristad and P. N. Yianilos. Learning string-edit distance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(5):522--532, 1998.
[14]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513--523, 1988.
[15]
V. Sehgal, L. Getoor, and P. D. Viechnicki. Entity resolution in geospatial data integration. In Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems, pages 83--90. ACM, 2006.

Cited By

View all
  • (2023)A Comparison of Cartographic and Toponymic Databases in a Multilingual Environment: A Methodology for Detecting Redundancies Using ETL and GIS ToolsISPRS International Journal of Geo-Information10.3390/ijgi1202007012:2(70)Online publication date: 18-Feb-2023
  • (2023)GeoDD: End-to-End Spatial Data De-duplication SystemData Science and Algorithms in Systems10.1007/978-3-031-21438-7_60(717-727)Online publication date: 4-Jan-2023
  • (2021)Spatial-temporal analysis of retail and services using Facebook Places data: a case study in Brno, Czech RepublicAnnals of GIS10.1080/19475683.2021.192184628:2(127-145)Online publication date: 29-Apr-2021
  • Show More Cited By

Index Terms

  1. Deduplicating a places database

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '14: Proceedings of the 23rd international conference on World wide web
    April 2014
    926 pages
    ISBN:9781450327442
    DOI:10.1145/2566486

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 April 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. entity matching
    2. geographic information systems
    3. record linkage
    4. term weighting

    Qualifiers

    • Research-article

    Conference

    WWW '14
    Sponsor:
    • IW3C2

    Acceptance Rates

    WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Comparison of Cartographic and Toponymic Databases in a Multilingual Environment: A Methodology for Detecting Redundancies Using ETL and GIS ToolsISPRS International Journal of Geo-Information10.3390/ijgi1202007012:2(70)Online publication date: 18-Feb-2023
    • (2023)GeoDD: End-to-End Spatial Data De-duplication SystemData Science and Algorithms in Systems10.1007/978-3-031-21438-7_60(717-727)Online publication date: 4-Jan-2023
    • (2021)Spatial-temporal analysis of retail and services using Facebook Places data: a case study in Brno, Czech RepublicAnnals of GIS10.1080/19475683.2021.192184628:2(127-145)Online publication date: 29-Apr-2021
    • (2021)A novel similarity measure for spatial entity resolution based on data granularity model: Managing inconsistencies in place descriptionsApplied Intelligence10.1007/s10489-020-01959-yOnline publication date: 31-Jan-2021
    • (2021)Linking place records using multi-view encodersNeural Computing and Applications10.1007/s00521-021-05932-9Online publication date: 27-Mar-2021
    • (2020)Boosting toponym interlinking by paying attention to both machine and deep learningProceedings of the Sixth International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data10.1145/3403896.3403970(1-5)Online publication date: 14-Jun-2020
    • (2020)Mining Events through Activity Title Extraction and Venue Coupling2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)10.1109/TAAI51410.2020.00033(136-141)Online publication date: Dec-2020
    • (2020)Harnessing Heterogeneous Big Geospatial DataHandbook of Big Geospatial Data10.1007/978-3-030-55462-0_17(459-473)Online publication date: 17-Dec-2020
    • (2020)Learning Advanced Similarities and Training Features for Toponym InterlinkingAdvances in Information Retrieval10.1007/978-3-030-45439-5_8(111-125)Online publication date: 14-Apr-2020
    • (2019)Learning Domain Specific Models for Toponym InterlinkingProceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/3347146.3359339(504-507)Online publication date: 5-Nov-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media