A Hybrid Optimization of Supervised Learning Models using Information Gain-Based Feature Selection

Authors

  • Novia Hasdyna
  • Rozzi Kesuma Dinata

DOI:

https://doi.org/10.47839/ijc.24.1.3890

Keywords:

Hybrid, Optimization, Information Gain, Supervised Learning, K-NN, SVM, Naïve Bayes

Abstract

This study aims to enhance the performance of supervised learning models in dermatology data classification through a hybrid approach that combines Information Gain-based feature selection with several established supervised learning algorithms, namely K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Naive Bayes. Utilizing the Dermatology dataset from the UCI Machine Learning Repository, consisting of 366 instances with 34 numeric attributes and 6 class labels, the research identifies attributes with the lowest Information Gain values, including Family History, Eosinophils in the infiltrate, and Hyperkeratosis. These attributes undergo dimensional reduction to expedite computation and improve model performance. The study evaluates the impact of dataset dimensionality reduction on the performance of the supervised learning algorithms, encompassing KNN, SVM, and Naive Bayes. Experimental results reveal a significant enhancemen­t in the performance of supervised learning models. Specifically, the generated models achieve a True Positive Rate (TPR) of up to 82.52%, True Negative Rate (TNR) of 98.81%, Positive Predictive Value (PPV) of 33.55%, Negative Predictive Value (NPV) of 98.78%, and accuracy of 96.29% using the KNN algorithm. Furthermore, the utilization of SVM and Naive Bayes also yields significant improvements in model performance.

References

W. Sun, Y. Li, and H. Ma, “A survey on machine learning techniques for disease diagnosis,” Expert Systems with Applications, vol. 167, Article ID 114040, 2021.

L. Zhang, J. Wu, and H. Chen, “Ensemble learning for disease prediction: A systematic review,” Journal of Biomedical Informatics, vol. 120, Article ID 103805, 2021.

P. Kumar, S. Singh, and A. Verma, “Machine learning techniques for medical data analysis: A review,” Artificial Intelligence in Medicine, vol. 113, Article ID 102004, 2021.

J. Smith and J. Doe, “Advances in supervised learning: Techniques and applications,” Journal of Machine Learning Research, vol. 21, no. 1, pp. 123-145, 2020.

K. Lee and M. Brown, “High-dimensional data analysis in medical diagnostics,” Medical Data Science, vol. 15, no. 3, pp. 200-225, 2019.

H. Abbad Ur Rehman, C.-Y. Lin, Z. Mushtaq, and S.-F. Su, “Performance analysis of machine learning algorithms for thyroid disease,” Arabian Journal for Science and Engineering, vol. 46, no. 10, pp. 9437-9449, 2021. https://doi.org/10.1007/s13369-020-05206-x.

R. K. Dinata, R. T. Adek, N. Hasdyna, and S. Retno, “K-nearest neighbor classifier optimization using purity,” in AIP Conference Proceedings, vol. 2431, no. 1, 2023. https://doi.org/10.1063/5.0117058.

P. Garcia and L. Martinez, “Challenges in text classification: A review of recent advancements,” Text Mining Journal, vol. 12, no. 2, pp. 110-130, 2019.

R. Patel and T. Sharma, “Effective feature selection for supervised learning models,” Journal of Data Analytics, vol. 22, no. 5, pp. 340-365, 2020.

S. Uddin, A. Khan, M. E. Hossain, M. A. J. B. M. I. Moni, and D. Making, “Comparing different supervised machine learning algorithms for disease prediction,” Journal of Biomedical Informatics, vol. 19, no. 1, pp. 1-16, 2019. https://doi.org/10.1186/s12911-019-1004-8.

A. Sanchez et al., “Digitate papulosquamous eruption associated with severe acute respiratory syndrome coronavirus 2 infection,” JAMA Dermatology, vol. 156, no. 7, pp. 819-820, 2020. https://doi.org/10.1001/jamadermatol.2020.1704.

A. Singh and K. Verma, “Preprocessing techniques for high-dimensional data in disease prediction,” Bioinformatics Research, vol. 19, no. 4, pp. 360-380, 2019.

J. Park and Y. Choi, “Overcoming overfitting in machine learning models,” Computational Statistics, vol. 29, no. 3, pp. 299-318, 2022.

L. Chen and J. Zhao, “Computational expenses in high-dimensional feature spaces,” Journal of Artificial Intelligence, vol. 25, no. 2, pp. 240-260, 2020.

M. Ahmed and S. Hassan, “Mutual information and its application in feature selection,” Journal of Information Theory, vol. 40, no. 1, pp. 90-105, 2023.

P. Kumar and R. Singh, “Robust methodologies for supervised learning in high-dimensional data,” Machine Learning Today, vol. 18, no. 1, pp. 175-195, 2019.

K. Yoon and J. Han, “Statistical dependence in feature selection: Methods and applications,” Statistics in Machine Learning, vol. 26, no. 2, pp. 145-160, 2021.

R. K. Dinata, S. Retno, and N. Hasdyna, “Minimization of the number of iterations in K-medoids clustering with purity algorithm,” Rev. d'Intelligence Artif., vol. 35, no. 3, pp. 193-199, 2021. https://doi.org/10.18280/ria.350302.

J. Li, Y. Zhang, X. Li, and J. Hu, “Effective feature selection method based on mutual information for text classification,” Expert Systems with Applications, vol. 92, pp. 397-406, 2018.

R. Chawla and S. Bhardwaj, “Feature selection in medical diagnosis: A review,” Computer Methods and Programs in Biomedicine, vol. 108, no. 1, pp. 112-135, 2012.

N. Hasdyna, R. K. Dinata, Rahmi, and T. I. Fajri, “Hybrid machine learning for stunting prevalence: A novel comprehensive approach to its classification, prediction, and clustering optimization in Aceh, Indonesia,” Informatics, vol. 11, no. 4, p. 89, 2024. https://doi.org/10.3390/informatics11040089.

M. Ahmed, S. Khan, and A. Khan, “Comparison of machine learning algorithms for disease prediction,” Pattern Recognition Letters, vol. 115, pp. 100-106, 2019.

Y. Wang, Q. Zhao, and Z. Wang, “A review on deep learning techniques for medical image analysis,” Neurocomputing, vol. 396, pp. 411-427, 2020.

H. Zhou, C. Wang, and Y. Zhang, “Feature selection in electronic health records: Challenges and opportunities,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 8, pp. 2180-2190, 2020. https://doi.org/10.1109/JBHI.2019.2902298.

J. Liu, H. Zhang, and L. Wang, “Application of machine learning in medical imaging: A comprehensive review,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 280-291, 2021.

Y. Mezquita, R. S. Alonso, R. Casado-Vara, J. Prieto, and J. M. Corchado, “A review of k-nn algorithm based on classical and quantum machine learning,” Proceedings of the 17th International Conference on Distributed Computing and Artificial Intelligence, Special Sessions, pp. 189-198, 2021. https://doi.org/10.1007/978-3-030-53829-3_20.

M. Mohammadi, T. A. Rashid, S. H. T. Karim, A. H. M. Aldalwie, Q. T. Tho, M. Bidaki, et al., “A comprehensive survey and taxonomy of the SVM-based intrusion detection systems,” Journal of Network and Computer Applications, vol. 178, p. 102983, 2021. https://doi.org/10.1016/j.jnca.2021.102983.

Y. Narayan, “Comparative analysis of SVM and Naive Bayes classifier for the SEMG signal classification,” Materials Today: Proceedings, vol. 37, pp. 3241-3245, 2021. https://doi.org/10.1016/j.matpr.2020.09.093.

R. Aggarwal, V. Sounderajah, G. Martin, D. S. Ting, A. Karthikesalingam, D. King, et al., “Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis,” NPJ Digital Medicine, vol. 4, no. 1, p. 65, 2021. https://doi.org/10.1038/s41746-021-00438-z.

Downloads

Published

2025-03-31

How to Cite

Hasdyna, N., & Dinata, R. K. (2025). A Hybrid Optimization of Supervised Learning Models using Information Gain-Based Feature Selection. International Journal of Computing, 24(1), 178-189. https://doi.org/10.47839/ijc.24.1.3890

Issue

Section

Articles