ACM Home Page
Please provide us with feedback. Feedback
Incorporating string transformations in record matching
Full text PdfPdf (430 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
DEMONSTRATION SESSION: Group 1 table of contents
Pages 1231-1234  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Arvind Arasu  Microsoft Research, Redmond, WA, USA
Surajit Chaudhuri  Microsoft Research, Redmond, WA, USA
Kris Ganjam  Microsoft Research, Redmond, WA, USA
Raghav Kaushik  Microsoft Research, Redmond, WA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 80,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376742
What is a DOI?

ABSTRACT

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Nick Koudas, Sunita Sarawagi, Divesh Srivastava. Record linkage: Similarity Measures and Algorithms. ACM SIGMOD 2006.
 
2
A. Arasu, S. Chaudhuri and R. Kaushik. Transformation based Framework for Record Matching. IEEE ICDE 2008.
 
3
A. Arasu, V. Ganti and R. Kaushik. Efficient Exact Set Similarity Joins. VLDB 2006.
 
4
S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. R. Narasayya, T. Vassilakis. Data cleaning in Microsoft SQL server 2005. ACM SIGMOD 2005.
 
5
A. Gionis, P. Indyk and R. Motwani. Similarity search in high dimensions via hashing. VLDB 1999.
 
6
Microsoft SQL Server Integration Services. http://msdn2.microsoft.com/en-us/library/ms141026.aspx

Collaborative Colleagues:
Arvind Arasu: colleagues
Surajit Chaudhuri: colleagues
Kris Ganjam: colleagues
Raghav Kaushik: colleagues