Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients
Asha Gowda Karegowda1, M.A. Jayaram2, A.S. Manjunath3
1Asha Gowda Karegowda, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India.
2M.A. Jayaram, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India.
3A.S. Manjunath, Dept. of Computer Science and Engg., Siddaganga Institute of Technology, Tumkur, India.
Manuscript received on January 17, 2012. | Revised Manuscript received on February 05, 2012. | Manuscript published on February 29, 2012. | PP: 147-151 | Volume-1 Issue-3, February 2012. | Retrieval Number: C0211021312/2011©BEIESP

Open Access | Ethics and  Policies | Cite
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Medical Data mining is the process of extracting hidden patterns from medical data. This paper presents the development of a hybrid model for classifying Pima Indian diabetic database (PIDD). The model consists of three stages. In the first stage, K-means clustering is used to identify and eliminate incorrectly classified instances. In the second stage Genetic algorithm (GA) and Correlation based feature selection (CFS) is used in a cascaded fashion for relevant feature extraction, where GA rendered global search of attributes with fitness evaluation effected by CFS. Finally in the third stage a fine tuned classification is done using K-nearest neighbor (KNN) by taking the correctly clustered instance of first stage and with feature subset identified in the second stage as inputs for the KNN. Experimental results signify the cascaded K-means clustering and KNN along with feature subset identified GA_CFS has enhanced classification accuracy of KNN. The proposed model obtained the classification accuracy of 96.68% for diabetic dataset.
Keywords: Genetic algorithm, Correlation based feature selection ,K-nearest neighbor, K-means clustering , Pima Indian Diabetics.