skip to main content
10.1145/3075564.3075576acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset

Authors Info & Claims
Published:15 May 2017Publication History

ABSTRACT

Diabetes is becoming a more and more serious health challenge worldwide with the yearly rising prevalence, especially in developing countries. The vast majority of diabetes are type 2 diabetes, which has been indicated that about 80% of type 2 diabetes complications can be prevented or delayed by timely detection. In this paper, we propose an ensemble model to precisely diagnose the diabetic on a large-scale and imbalance dataset. The dataset used in our work covers millions of people from one province in China from 2009 to 2015, which is highly skew. Results on the real-world dataset prove that our method is promising for diabetes diagnosis with a high sensitivity, F3 and G --- mean, i.e, 91.00%, 58.24%, 86.69%, respectively.

References

  1. Nahla Barakat, Andrew P Bradley, and Mohamed Nabil H Barakat. 2010. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE transactions on information technology in biomedicine 14, 4 (2010), 1114--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108--122.Google ScholarGoogle Scholar
  5. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357. Google ScholarGoogle ScholarCross RefCross Ref
  6. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Esin Dogantekin, Akif Dogantekin, Derya Avci, and Levent Avci. 2010. An intelligent diagnosis system for diabetes on linear discriminant analysis and adaptive network based fuzzy inference system: LDA-ANFIS. Digital Signal Processing 20, 4 (2010), 1248--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chris Drummond, Robert C Holte, and others. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Citeseer.Google ScholarGoogle Scholar
  9. Monica Franciosi, Giorgia De Berardis, Maria CE Rossi, Michele Sacco, Maurizio Belfiglio, Fabio Pellegrini, Gianni Tognoni, Miriam Valentini, and Antonio Nicolucci. 2005. Use of the diabetes risk score for opportunistic screening of undiagnosed diabetes and impaired glucose tolerance. Diabetes Care 28, 5 (2005), 1187--1194.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google ScholarGoogle Scholar
  11. David G Shoback Gardner, Dolores Greenspan, and others. 2007. Greenspan's basic & clinical endocrinology. McGraw-Hill Medical,.Google ScholarGoogle Scholar
  12. Ruchika Goel, Anoop Misra, Dimple Kondal, Ravindra M Pandey, Naval K Vikram, Jasjeet S Wasir, Vibha Dhingra, and Kalpana Luthra. 2009. Identification of insulin resistance in Asian Indian adolescents: classification and regression tree (CART) and logistic regression based classification rules. Clinical endocrinology 70, 5 (2009), 717--724.Google ScholarGoogle Scholar
  13. Longfei Han, Senlin Luo, Jianmin Yu, Limin Pan, and Songjing Chen. 2015. Rule extraction from support vector machines using ensemble learning approach: an application for diagnosis of diabetes. IEEE journal of biomedical and health informatics 19, 2 (2015), 728--734.Google ScholarGoogle Scholar
  14. Kenneth E Heikes, David M Eddy, Bhakti Arondekar, and Leonard Schlessinger. 2008. Diabetes risk calculator. Diabetes care 31, 5 (2008), 1040--1045.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yue Huang, Paul McCullagh, Norman Black, and Roy Harper. 2007. Feature selection and classification model construction on type 2 diabetic patients data. Artificial intelligence in medicine 41, 3 (2007), 251--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lin Li. 2014. Diagnosis of Diabetes Using a Weight-Adjusted Voting Approach. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on. IEEE, 320--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. World Health Organization and others. 2006. Definition and diagnosis of diabetes mellitus and intermediate hyperglycaemia. (2006).Google ScholarGoogle Scholar
  19. World Health Organization and others. 2016. Global report on diabetes. (2016).Google ScholarGoogle Scholar
  20. Jaakko Tuomilehto, Jaana Lindström, Johan G Eriksson, Timo T Valle, Helena Hämäläinen, Pirjo Ilanne-Parikka, Sirkka Keinänen-Kiukaanniemi, Mauri Laakso, Anne Louheranta, Merja Rastas, and others. 2001. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. New England Journal of Medicine 344, 18 (2001), 1343--1350.Google ScholarGoogle ScholarCross RefCross Ref
  21. Gary M Weiss. 2004. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 7--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2, 3 (1972), 408--421.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jianxin Wu, S Charles Brubaker, Matthew D Mullin, and James M Rehg. 2008. Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 3 (2008), 369--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 1 (2006), 63--77. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Ensemble Model for Diabetes Diagnosis in Large-scale and Imbalanced Dataset

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CF'17: Proceedings of the Computing Frontiers Conference
        May 2017
        450 pages
        ISBN:9781450344876
        DOI:10.1145/3075564

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 May 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        CF'17 Paper Acceptance Rate43of87submissions,49%Overall Acceptance Rate240of680submissions,35%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader