research-article

An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset

Authors:
Xun Wei

University of Science and Technology of China, Hefei, Anhui PRC

University of Science and Technology of China, Hefei, Anhui PRC
View Profile

,
Fan Jiang

University of Science and Technology of China, Hefei, Anhui PRC

University of Science and Technology of China, Hefei, Anhui PRC
View Profile

,
Feng Wei

University of Science and Technology of China, Hefei, Anhui PRC

University of Science and Technology of China, Hefei, Anhui PRC
View Profile

,
Jiekui Zhang

Anhui Jingqi Network Technology, Inc., Hefei, Anhui PRC

Anhui Jingqi Network Technology, Inc., Hefei, Anhui PRC
View Profile

,
Weiwei Liao

DEI, Universita di Bologna, Bologna, Italy

DEI, Universita di Bologna, Bologna, Italy
View Profile

,
Shaoyin Cheng

University of Science and Technology of China, Hefei, Anhui PRC

University of Science and Technology of China, Hefei, Anhui PRC
View Profile

Authors Info & Claims

CF'17: Proceedings of the Computing Frontiers ConferenceMay 2017Pages 71–78https://doi.org/10.1145/3075564.3075576

Published:15 May 2017Publication History

CF'17: Proceedings of the Computing Frontiers Conference

Pages 71–78

ABSTRACT

Diabetes is becoming a more and more serious health challenge worldwide with the yearly rising prevalence, especially in developing countries. The vast majority of diabetes are type 2 diabetes, which has been indicated that about 80% of type 2 diabetes complications can be prevented or delayed by timely detection. In this paper, we propose an ensemble model to precisely diagnose the diabetic on a large-scale and imbalance dataset. The dataset used in our work covers millions of people from one province in China from 2009 to 2015, which is highly skew. Results on the real-world dataset prove that our method is promising for diabetes diagnosis with a high sensitivity, F3 and G --- mean, i.e, 91.00%, 58.24%, 86.69%, respectively.

References

Nahla Barakat, Andrew P Bradley, and Mohamed Nabil H Barakat. 2010. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE transactions on information technology in biomedicine 14, 4 (2010), 1114--1120. Google ScholarDigital Library
Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 20--29. Google ScholarDigital Library
Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140. Google ScholarDigital Library
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.Google Scholar
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357. Google ScholarCross Ref
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarDigital Library
Esin Dogantekin, Akif Dogantekin, Derya Avci, and Levent Avci. 2010. An intelligent diagnosis system for diabetes on linear discriminant analysis and adaptive network based fuzzy inference system: LDA-ANFIS. Digital Signal Processing 20, 4 (2010), 1248--1255. Google ScholarDigital Library
Chris Drummond, Robert C Holte, and others. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Citeseer.Google Scholar
Monica Franciosi, Giorgia De Berardis, Maria CE Rossi, Michele Sacco, Maurizio Belfiglio, Fabio Pellegrini, Gianni Tognoni, Miriam Valentini, and Antonio Nicolucci. 2005. Use of the diabetes risk score for opportunistic screening of undiagnosed diabetes and impaired glucose tolerance. Diabetes Care 28, 5 (2005), 1187--1194.Google ScholarCross Ref
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
David G Shoback Gardner, Dolores Greenspan, and others. 2007. Greenspan's basic & clinical endocrinology. McGraw-Hill Medical,.Google Scholar
Ruchika Goel, Anoop Misra, Dimple Kondal, Ravindra M Pandey, Naval K Vikram, Jasjeet S Wasir, Vibha Dhingra, and Kalpana Luthra. 2009. Identification of insulin resistance in Asian Indian adolescents: classification and regression tree (CART) and logistic regression based classification rules. Clinical endocrinology 70, 5 (2009), 717--724.Google Scholar
Longfei Han, Senlin Luo, Jianmin Yu, Limin Pan, and Songjing Chen. 2015. Rule extraction from support vector machines using ensemble learning approach: an application for diagnosis of diabetes. IEEE journal of biomedical and health informatics 19, 2 (2015), 728--734.Google Scholar
Kenneth E Heikes, David M Eddy, Bhakti Arondekar, and Leonard Schlessinger. 2008. Diabetes risk calculator. Diabetes care 31, 5 (2008), 1040--1045.Google ScholarCross Ref
Yue Huang, Paul McCullagh, Norman Black, and Roy Harper. 2007. Feature selection and classification model construction on type 2 diabetic patients data. Artificial intelligence in medicine 41, 3 (2007), 251--262. Google ScholarDigital Library
Lin Li. 2014. Diagnosis of Diabetes Using a Weight-Adjusted Voting Approach. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on. IEEE, 320--324. Google ScholarDigital Library
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550. Google ScholarDigital Library
World Health Organization and others. 2006. Definition and diagnosis of diabetes mellitus and intermediate hyperglycaemia. (2006).Google Scholar
World Health Organization and others. 2016. Global report on diabetes. (2016).Google Scholar
Jaakko Tuomilehto, Jaana Lindström, Johan G Eriksson, Timo T Valle, Helena Hämäläinen, Pirjo Ilanne-Parikka, Sirkka Keinänen-Kiukaanniemi, Mauri Laakso, Anne Louheranta, Merja Rastas, and others. 2001. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. New England Journal of Medicine 344, 18 (2001), 1343--1350.Google ScholarCross Ref
Gary M Weiss. 2004. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 7--19. Google ScholarDigital Library
Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2, 3 (1972), 408--421.Google ScholarCross Ref
Jianxin Wu, S Charles Brubaker, Matthew D Mullin, and James M Rehg. 2008. Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 3 (2008), 369--382. Google ScholarDigital Library
Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 1 (2006), 63--77. Google ScholarDigital Library

Index Terms

An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning algorithms
      1. Ensemble methods

Recommendations

A novel evolutionary ensemble prediction model using harmony search and stacking for diabetes diagnosis
Abstract
Diabetes is a dreaded disease that can be identified by elevated blood glucose levels in the blood, and undiagnosed diabetes can cause a host of related complications, such as retinopathy and nephropathy. In terms of type, the main categories are ...
Read More
Rating the Severity of Diabetic Retinopathy on a Highly Imbalanced Dataset
Computer Aided Systems Theory – EUROCAST 2022
Abstract
Diabetic Retinopathy (DR) is an ocular complication of diabetes that leads to a significant loss of vision. Screening retinal fundus images allows ophthalmologists to early detect and diagnose this disease; however, the manual interpretation of ...
Read More
Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF'17: Proceedings of the Computing Frontiers Conference
May 2017
450 pages
ISBN:9781450344876
DOI:10.1145/3075564
General Chair:
Roberto Giorgi
University of Siena, IT
,
Program Chairs:
Michela Becchi
North Carolina State University
,
Francesca Palumbo
University of Sassari, IT
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Diabetes diagnosis
Ensemble model
Imbalanced data
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
CF'17 Paper Acceptance Rate43of87submissions,49%Overall Acceptance Rate240of680submissions,35%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 373
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset

CF'17: Proceedings of the Computing Frontiers Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel evolutionary ensemble prediction model using harmony search and stacking for diabetes diagnosis

Rating the Severity of Diabetic Retinopathy on a Highly Imbalanced Dataset

Over-sampling via under-sampling in strongly imbalanced data