ABSTRACT
Diabetes is becoming a more and more serious health challenge worldwide with the yearly rising prevalence, especially in developing countries. The vast majority of diabetes are type 2 diabetes, which has been indicated that about 80% of type 2 diabetes complications can be prevented or delayed by timely detection. In this paper, we propose an ensemble model to precisely diagnose the diabetic on a large-scale and imbalance dataset. The dataset used in our work covers millions of people from one province in China from 2009 to 2015, which is highly skew. Results on the real-world dataset prove that our method is promising for diabetes diagnosis with a high sensitivity, F3 and G --- mean, i.e, 91.00%, 58.24%, 86.69%, respectively.
- Nahla Barakat, Andrew P Bradley, and Mohamed Nabil H Barakat. 2010. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE transactions on information technology in biomedicine 14, 4 (2010), 1114--1120. Google ScholarDigital Library
- Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 20--29. Google ScholarDigital Library
- Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140. Google ScholarDigital Library
- Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.Google Scholar
- Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357. Google ScholarCross Ref
- Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarDigital Library
- Esin Dogantekin, Akif Dogantekin, Derya Avci, and Levent Avci. 2010. An intelligent diagnosis system for diabetes on linear discriminant analysis and adaptive network based fuzzy inference system: LDA-ANFIS. Digital Signal Processing 20, 4 (2010), 1248--1255. Google ScholarDigital Library
- Chris Drummond, Robert C Holte, and others. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Citeseer.Google Scholar
- Monica Franciosi, Giorgia De Berardis, Maria CE Rossi, Michele Sacco, Maurizio Belfiglio, Fabio Pellegrini, Gianni Tognoni, Miriam Valentini, and Antonio Nicolucci. 2005. Use of the diabetes risk score for opportunistic screening of undiagnosed diabetes and impaired glucose tolerance. Diabetes Care 28, 5 (2005), 1187--1194.Google ScholarCross Ref
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
- David G Shoback Gardner, Dolores Greenspan, and others. 2007. Greenspan's basic & clinical endocrinology. McGraw-Hill Medical,.Google Scholar
- Ruchika Goel, Anoop Misra, Dimple Kondal, Ravindra M Pandey, Naval K Vikram, Jasjeet S Wasir, Vibha Dhingra, and Kalpana Luthra. 2009. Identification of insulin resistance in Asian Indian adolescents: classification and regression tree (CART) and logistic regression based classification rules. Clinical endocrinology 70, 5 (2009), 717--724.Google Scholar
- Longfei Han, Senlin Luo, Jianmin Yu, Limin Pan, and Songjing Chen. 2015. Rule extraction from support vector machines using ensemble learning approach: an application for diagnosis of diabetes. IEEE journal of biomedical and health informatics 19, 2 (2015), 728--734.Google Scholar
- Kenneth E Heikes, David M Eddy, Bhakti Arondekar, and Leonard Schlessinger. 2008. Diabetes risk calculator. Diabetes care 31, 5 (2008), 1040--1045.Google ScholarCross Ref
- Yue Huang, Paul McCullagh, Norman Black, and Roy Harper. 2007. Feature selection and classification model construction on type 2 diabetic patients data. Artificial intelligence in medicine 41, 3 (2007), 251--262. Google ScholarDigital Library
- Lin Li. 2014. Diagnosis of Diabetes Using a Weight-Adjusted Voting Approach. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on. IEEE, 320--324. Google ScholarDigital Library
- Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550. Google ScholarDigital Library
- World Health Organization and others. 2006. Definition and diagnosis of diabetes mellitus and intermediate hyperglycaemia. (2006).Google Scholar
- World Health Organization and others. 2016. Global report on diabetes. (2016).Google Scholar
- Jaakko Tuomilehto, Jaana Lindström, Johan G Eriksson, Timo T Valle, Helena Hämäläinen, Pirjo Ilanne-Parikka, Sirkka Keinänen-Kiukaanniemi, Mauri Laakso, Anne Louheranta, Merja Rastas, and others. 2001. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. New England Journal of Medicine 344, 18 (2001), 1343--1350.Google ScholarCross Ref
- Gary M Weiss. 2004. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 7--19. Google ScholarDigital Library
- Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2, 3 (1972), 408--421.Google ScholarCross Ref
- Jianxin Wu, S Charles Brubaker, Matthew D Mullin, and James M Rehg. 2008. Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 3 (2008), 369--382. Google ScholarDigital Library
- Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 1 (2006), 63--77. Google ScholarDigital Library
Index Terms
- An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset
Recommendations
A novel evolutionary ensemble prediction model using harmony search and stacking for diabetes diagnosis
AbstractDiabetes is a dreaded disease that can be identified by elevated blood glucose levels in the blood, and undiagnosed diabetes can cause a host of related complications, such as retinopathy and nephropathy. In terms of type, the main categories are ...
Rating the Severity of Diabetic Retinopathy on a Highly Imbalanced Dataset
Computer Aided Systems Theory – EUROCAST 2022AbstractDiabetic Retinopathy (DR) is an ocular complication of diabetes that leads to a significant loss of vision. Screening retinal fundus images allows ophthalmologists to early detect and diagnose this disease; however, the manual interpretation of ...
Over-sampling via under-sampling in strongly imbalanced data
Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Comments