ACM Home Page
Please provide us with feedback. Feedback
A hit-miss model for duplicate detection in the WHO drug safety database
Full text PdfPdf (597 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining table of contents
Chicago, Illinois, USA
SESSION: Industry/government track paper table of contents
Pages: 459 - 468  
Year of Publication: 2005
ISBN:1-59593-135-X
Authors
G. Niklas Norén  WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden & Stockholm University, Stockholm, Sweden
Roland Orre  NeuroLogic Sweden AB, Stockholm, Sweden
Andrew Bate  WHO Collaborating Centre for International Drug Monitoring, Uppsala, Sweden
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 93,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1081870.1081923
What is a DOI?

ABSTRACT

The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction incidents that occur after drugs are introduced on the market. As in other post-marketing drug safety data sets, the presence of duplicate records is an important data quality problem and the detection of duplicates in the WHO drug safety database remains a formidable challenge, especially since the reports are anonymised before submitted to the database. However, to our knowledge no work has been published on methods for duplicate detection in post-marketing drug safety data. In this paper, we propose a method for probabilistic duplicate detection based on the hit-miss model for statistical record linkage described by Copas & Hilton. We present two new generalisations of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields. We demonstrate the effectiveness of the hit-miss model for duplicate detection in the WHO drug safety database both at identifying the most likely duplicate for a given record (94.7% accuracy) and at discriminating duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other applications throughout the KDD community.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Bate, M. Lindquist, I. R. Edwards, S. Olsson, R. Orre, A. Lansner, and R. M. De Freitas. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology, 54:315--321, 1998.
 
2
T. Belin and D. Rubin. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association, 90:694--707, 1995.
3
 
4
M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage and object consolidation, pages 7--12, 2003.
 
5
E. A. Bortnichak, R. P. Wise, M. E. Salive, and H. H. Tilson. Proactive safety surveillance. Pharmacoepidemiology and Drug Safety, 10:191--196, 2001.
 
6
A. D. Brinker and J. Beitz. Spontaneous reports of thrombocytopenia in association with quinine: clinical attributes and timing related to regulatory action. American Journal of Hematology, 70:313--317, 2002.
 
7
J. Copas and F. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society: Series A, 153(3):287--320, 1990.
 
8
I. R. Edwards. Adverse drug reactions: finding the needle in the haystack. British Medical Journal, 315(7107):500, 1997.
 
9
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
10
 
11
M. Lindquist. Data quality management in pharmacovigilance. Drug Safety, 27(12):857--870, 2004.
 
12
A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Research Issues on Data Mining and Knowledge Discovery, 1997.
 
13
H. B. Newcombe. Record linkage: the design of efficient systems for linking records into individual family histories. American Journal of Human Genetics, 19:335--359, 1967.
 
14
J. N. Nkanza and W. Walop. Vaccine associated adverse event surveillance (VAEES) and quality assurance. Drug Safety, 27:951--952, 2004.
 
15
 
16
M. D. Rawlins. Spontaneous reporting of adverse drug reactions. II: Uses. British Journal of Clinical Pharmacology, 1(26):7--11, 1988.
17



REVIEW

"John A. Fulcher : Reviewer"

Data cleaning is an essential first step in the knowledge discovery in databases (KDD) process. Apart from the removal of noise, another critical preprocessing task is the removal of duplicate records from the databases in question. The applicatio  more...

Collaborative Colleagues:
G. Niklas Norén: colleagues
Roland Orre: colleagues
Andrew Bate: colleagues