Examining Techniques to Solving Imbalanced Datasets in Educational Data Mining Systems
Keywords:educational data mining, machine learning, imbalanced datasets, prediction, student grade
The educational data mining research attempts have contributed in developing policies to improve student learning in different levels of educational institutions. One of the common challenges to building accurate classification and prediction systems is the imbalanced distribution of classes in the data collected. This study investigates data-level techniques and algorithm-level techniques. Six classifiers from each technique are used to explore their effectiveness to handle the imbalanced data problem while predicting students’ graduation grade based on their performance at the first stage. The classifiers are tested using the k-fold cross-validation approach before and after applying the data-level and algorithm-level techniques. For the purpose of evaluation, various evaluation metrics have been used such as accuracy, precision, recall, and f1-score. The results showed that the classifiers do not perform well with imbalanced dataset, and the performance could be improved by using these techniques. As for the level of improvement, it varies from one technique to another. Additionally, the results of the statistical hypothesis testing confirmed that there were no statistically significant differences for classifiers of the two techniques.
R. Kamath and R. Kamat, Educational Data Mining with R and Rattle, River Publishers, 2016.
C. Romero, S. Ventura, “Educational data mining: a review of the state of the art,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 6, pp. 601-618, 2010. https://doi.org/10.1109/TSMCC.2010.2053532.
B. Guo, R. Zhang, G. Xu, C. Shi, and L. Yang, “Predicting students performance in educational data mining,” Proceedings of the 2015 IEEE International Symposium on Educational Technology (ISET), 2015, pp. 125-128, https://doi.org/10.1109/ISET.2015.33.
R. Asif, A. Merceron, S. A. Ali, N. G. Haider, “Analyzing undergraduate students' performance using educational data mining,” Computers & Education, vol. 113, pp. 177-194, 2017. https://doi.org/10.1016/j.compedu.2017.05.007.
Y.-H. Hu, C.-L. Lo, S.-P. Shih, “Developing early warning systems to predict students’ online learning performance,” Computers in Human Behavior, vol. 36, pp. 469-478, 2014. https://doi.org/10.1016/j.chb.2014.04.002.
M. Alloghani, D. Al-Jumeily, J. Mustafina, A. Hussain, and A. J. Aljaaf, “A systematic review on supervised and unsupervised machine learning algorithms for data science,” Supervised and Unsupervised Learning for Data Science, pp. 3-21, 2020. https://doi.org/10.1007/978-3-030-22475-2_1.
S. Datta, A. Arputharaj, “An analysis of several machine learning algorithms for imbalanced classes, Proceedings of the 2018 5th IEEE International Conference on Soft Computing & Machine Intelligence (ISCMI), 2018, pp. 22-27. https://doi.org/10.1109/ISCMI.2018.8703244.
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2011. https://doi.org/10.1109/TSMCC.2011.2161285.
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220-239, 2017. https://doi.org/10.1016/j.eswa.2016.12.035.
Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623-1637, 2015. https://doi.org/10.1016/j.patcog.2014.11.014.
A. Jurek, Y. Bi, S. Wu, C. Nugent, “A survey of commonly used ensemble-based classification techniques,” The Knowledge Engineering Review, vol. 29, no. 5, p. 551, 2014. https://doi.org/10.1017/S0269888913000155.
R. Ghorbani, R. Ghousi, “Comparing different resampling methods in predicting Students’ performance using machine learning techniques,” IEEE Access, vol. 8, pp. 67899-67911, 2020. https://doi.org/10.1109/ACCESS.2020.2986809.
A. Acharya, D. Sinha, “Early prediction of students performance using machine learning techniques,” International Journal of Computer Applications, vol. 107, no. 1, pp. 37-43, 2014. https://doi.org/10.5120/18717-9939.
F. Marbouti, H. A. Diefes-Dux, K. Madhavan, “Models for early prediction of at-risk students in a course using standards-based grading,” Computers & Education, vol. 103, pp. 1-15, 2016. https://doi.org/10.1016/j.compedu.2016.09.005.
L. C. Liñán, Á. A. J. Pérez, “Educational data mining and learning analytics: differences, similarities, and time evolution,” International Journal of Educational Technology in Higher Education, vol. 12, no. 3, pp. 98-112, 2015. https://doi.org/10.7238/rusc.v12i3.2515.
Z. Kovacic, “Early prediction of student success: Mining students’ enrolment data,” Proceedings of the Informing Science and IT Education Conference (InSITE), pp. 647–65, 2010. https://doi.org/10.28945/1281.
C.-F. Tsai, C.-T. Tsai, C.-S. Hung, and P.-S. Hwang, “Data mining techniques for identifying students at risk of failing a computer proficiency test required for graduation,” Australasian Journal of Educational Technology, vol. 27, no. 3, 2011. https://doi.org/10.14742/ajet.956.
S. Ismail, S. Abdullah, “Design and implementation of an intelligent system to predict the student graduation AGPA,” Australian Educational Computing, vol. 30, no. 2, 2015.
I. E. Livieris, K. Drakopoulou, V. T. Tampakas, T. A. Mikropoulos, P. Pintelas, “Predicting secondary school students’ performance utilizing a semi-supervised learning approach,” Journal of Educational Computing Research, vol. 57, no. 2, pp. 448-470, 2019. https://doi.org/10.1177/0735633117752614
P. Viswanath, T. H. Sarma, “An improvement to k-nearest neighbor classifier,” Proceedings of the 2011 IEEE Recent Advances in Intelligent Computational Systems, 2011, pp. 227-231. https://doi.org/10.1109/RAICS.2011.6069307.
X. Yang, L. Tan, L. He, “A robust least squares support vector machine for regression and classification with noise,” Neurocomputing, vol. 140, pp. 41-52, 2014. https://doi.org/10.1016/j.neucom.2014.03.037.
Z. Yu, L. Li, J. Liu, G. Han, “Hybrid adaptive classifier ensemble,” IEEE Trans. Cybern., vol. 45, no. 2, pp. 177–190, 2014. https://doi.org/10.1109/TCYB.2014.2322195.
Priyanka and D. Kumar, “Decision tree classifier: A detailed survey,” Int. J. Inf. Decis. Sci., vol. 12, no. 3, pp. 246–269, 2020. https://doi.org/10.1504/IJIDS.2020.108141.
I. Taheri, M. Mammadov, “Learning the naive Bayes classifier with optimization models,” Int. J. Appl. Math. Comput. Sci., vol. 23, no. 4, pp. 787–795, 2013. https://doi.org/10.2478/amcs-2013-0059.
J. V Shi, W. Yin, S. J. Osher, “Linearized Bregman for l1-regularized logistic regression,” Proceedings of the 30th International Conference on Machine Learning (ICML), 2013, p. 1-3.
C. Zhang, Y. Ma, Ensemble Machine Learning: Methods and Applications, Springer, 2012. https://doi.org/10.1007/978-1-4419-9326-7.
T. Chen, C. Guestrin, “Xgboost: A scalable tree boosting system,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016. https://doi.org/10.1145/2939672.2939785.
K. Fawagreh, M. M. Gaber, E. Elyan, “Random forests: from early developments to recent advancements,” Syst. Sci. Control Eng. An Open Access J., vol. 2, no. 1, pp. 602–609, 2014. https://doi.org/10.1080/21642583.2014.956265.
S. A. Abdullah, A. Al-Ashoor, “An artificial deep neural network for the binary classification of network traffic,” Int. J. Adv. Computer. Sci. Appl., vol. 11, no. 1, pp. 402-408, 2020. https://doi.org/10.14569/IJACSA.2020.0110150.
M. S. Alam, S. T. Vuong, “Random forest classification for detecting android malware,” Proceedings of the 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, 2013, pp. 663–669. https://doi.org/10.1109/GreenCom-iThings-CPSCom.2013.122.
G. Sahoo, Y. Kumar, “Analysis of parametric & non parametric classifiers for classification technique using WEKA,” International Journal of Information Technology and Computer Science (IJITCS), vol. 4, no. 7, p. 43, 2012. https://doi.org/10.5815/ijitcs.2012.07.06.
K. Jiang, J. Lu, K. Xia, “A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE,” Arab. J. Sci. Eng., vol. 41, no. 8, pp. 3255–3266, 2016. https://doi.org/10.1007/s13369-016-2179-2.
S. Singh, P. Gupta, “Comparative study ID3, cart and C4. 5 decision tree algorithm: a survey,” International Journal of Advanced Information Science and Technology (IJAIST), vol. 27, no. 27, pp. 97-103, 2014.
H.-W. Liao, D.-L. Zhou, “Review of AdaBoost and its improvement,” Jisuanji Xitong Yingyong – Computer Systems and Applications, vol. 21, no. 5, pp. 240-244, 2012.
How to Cite
LicenseInternational Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:
• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.