Ensemble-based Disease Outbreak Detection: Comparative Analysis of Health News Information Retrieval Techniques

Authors

  • Manju Joy
  • M. Krishnaveni

DOI:

https://doi.org/10.47839/ijc.23.4.3547

Keywords:

Ensemble learning, Epidemic surveillance, Outbreak detection, Text mining, Natural language processing

Abstract

In India, Kerala is the first state to report a COVID-19 infection case, in January 2020, in a medical student, who returned from Wuhan, China. More recently, in June 2022, Kerala also reported India's first case of monkeypox disease. News websites often publish articles dedicated to reporting disease occurrences and live updates of outbreaks. Through the utilization of data gathered from online digital resources, early detection of outbreaks is possible, and this potential is already identified by the research community. As webpages give a comprehensive collection of reports covering a wide range of themes through hyperlinks, precisely categorizing news articles based on their headlines and retrieving health news is a tedious operation. Hence, this paper proposes a novel and efficient news retrieval technique grounded on an ML-based classification method with an ensemble learning approach to identify reports of disease occurrences from web pages by focusing specifically on the health context of Kerala and a comparison with baseline methods for information retrieval such as keyword-based, phrase-based, and content-based latent semantic analysis method is made.

References

Clark C. Freifeld, Kenneth D. Mandl, Ben Y. Reis, John S. Brownstein, “HealthMap: Global infectious diseases monitoring through automated classification and visualization of internet media reports,” Journal of the American Medical Informatics Association: JAMIA, vol. 15, issue 2, pp. 150-157, 2008. https://doi.org/10.1197/jamia.M2544

S. Jayesh, S. Sreedharan, “Analysing the Covid-19 cases in Kerala: A visual exploratory data analysis approach,” SN Comprehensive Clinical Medicine, vol. 2, pp. 1337-1348. https://doi.org/10.1007/s42399-020-00451-5

A. Jain, J. Mandowara, “Text classification by combining text classifiers to improve efficiency of classification,” International Journal of Computer Applications, vol. 6, no. 2, pp. 126-129, 2016.

E. Arsevska, S. Valentin, J. Rabatel, J. de Goër de Hervé, S. Falala, R. Lancelot, M. Roche, “Web monitoring of emerging animal infectious diseases integrated in the French animal health epidemic intelligence system,” PLOS one, pp. 1-25, 2018. https://doi.org/10.1371/journal.pone.0199960

M. Kim, K. Chae, S. Lee, H. J. Jang and S. Kim, “Automated classification of online sources for infectious disease occurrences using machine-learning-based natural language processing approaches,” International Journal of Environmental Research and Public Health, vol. 17, no. 24, 2020. https://doi.org/10.3390/ijerph17249467

B. Jang, M. Kim, I. Kim and J. Kim, “Eagle eye: A worldwide disease-related topic extraction system using deep learning based ranking algorithm and internet-source data,” Sensors, vol. 21, no. 14, 2021. https://doi.org/10.3390/s21144665

R. Hidayat and S. Minati, “Comparative analysis of text mining classification algorithms for English and Indonesian Qur'an translation,” International Journal on Informatics for Development, vol. 8, no. 1, pp. 47-51, 2019. https://doi.org/10.14421/ijid.2019.08108

I. A. Kandhro , S. Z. Jumani, A. A. Lashari, S. S. Nangraj, Q. A. Lakhan, M. T. Baig and S. Guriro, “Classification of Sindhi headline news documents based on TF-IDF text analysis scheme,” Indian Journal of Science and Technology, vol. 12, no. 33, pp. 1-10, 2019. https://doi.org/10.17485/ijst/2019/v12i33/146130

M. Fayaz, A. Khan, J. Ur Rahman, A. Alharbi, M. Irfan Uddin, B. Alouffi, “Ensemble machine learning model for classification of spam product reviews,” Hindawi, vol. 2020, Article ID 8857570, pp. 1-10. https://doi.org/10.1155/2020/8857570

M. Rott and P. Cerva, “Investigation of latent semantic analysis for clustering of Czech news articles,” Proceedings of the 25th IEEE International Workshop on Database and Expert Systems Applications, 2014, pp. 223-227. https://doi.org/10.1109/DEXA.2014.54

M. I. Rana, S. Khalid, M. U. Akbar, “News classification based on their headlines: A review,” Proceedings of the IEEE 17th International Multi-Topic Conference, 2014, pp. 211-216. https://doi.org/10.1109/INMIC.2014.7097339

X. Luo, “Efficient English text classification using selected machine learning techniques,” Alexandria Engineering Journal, vol. 60, no. 3, pp. 3401-3409, 2021. https://doi.org/10.1016/j.aej.2021.02.009

U. Suleymanov, S. Rustamov, “Automated news classification using machine learning methods,” Proceedings of the IOP Conference Series: Materials Science and Engineering, 2018 IOP Conf. Ser.: Mater. Sci. Eng. 459 012006. https://doi.org/10.1088/1757-899X/459/1/012006

S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu and J. Gao, “Deep learning based text classification: A comprehensive review,” ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1-40, 2021. https://doi.org/10.1145/3439726

R. Singh, S. A. Chun, V. Atluri, “Developong machine learning models to automate news classification,” Proceedings of the 21st Annual International Conference on Digital Goverment Research, 2020. https://doi.org/10.1145/3396956.3397001

T. Xia, Y. Chai, “An improvement to TF-IDF: Term distribution based term weight algorithm,” Journal of Software, vol. 6, no. 3, pp. 413-420, 2011. https://doi.org/10.4304/jsw.6.3.413-420

M. Nasir , M. Bakhtyar, J. Baber, S. Lakho, B. Ahmed and W. Noor, "BIOPAK flasher: Epidemic disease monitoring and detection in Pakistan using text mining,” arXiv:2106.06720, 2021. https://doi.org/10.48550/arXiv.2106.06720

M. A. Fauzi, A. Z. Arifin, S. C. Gosaria, “Indonesian news classification using Naive Bayes and two-phase feature selection model,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 8, no. 3, pp. 610 - 615, 2017. http://doi.org/10.11591/ijeecs.v8.i3.pp610-615

T. Jacob John, K. Rajappan, K. K. Arjunan, “Communicable diseases monitored by disease surveillance in Kottayam District, Kerala state,” Indian J Med, vol. 120, no. 2, pp. 86-93, 2004.

S. V. Gaikwad, A. Chaugule, P. Patil, “Text mining methods and techniques,” International Journal of Computer Applications, vol. 85, pp. 42-45, 2014. https://doi.org/10.5120/14937-3507

L. M. Abualigah, A. T. Khader, M. A. Al-Betar, O. A. Alomari, “Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering,” Expert System with Applications, vol. 84, pp. 24-36, 2017. https://doi.org/10.1016/j.eswa.2017.05.002

L.-M. Chen, B.-X. Xiu, Z.-Y. Ding, “Multiple weak supervision for short text classification,” Applied Intelligence, vol. 52, pp. 9101-9116, 2022. https://doi.org/10.1007/s10489-021-02958-3

C. Dreisbach, T. A. Koleck, P. E. Bourne and S. Bakken, “A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data,” International Journal of Medical Informatics, vol. 125, pp. 37-46, 2019. https://doi.org/10.1016/j.ijmedinf.2019.02.008

L. Yao, Z. Pengzhou and Z. Chi, “Research on news keyword extraction technology based on TF-IDF and TextRank,” Proceedings of the IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), 2019, pp. 452-455. https://doi.org/10.1109/ICIS46139.2019.8940293

D. Wang, H. Zhang, “Inverse-category-frequency based supervised term weighting schemes for text categorization,” Journal of Information Science and Engineering, vol. 29, no. 2, pp. 209-225, 2013.

M. B. Khan, “Urdu news classification using application of machine learning algorithms on news headline,” International Journal of Computer Science and Network Security, vol. 21, no. 2, pp. 229-237, 2021. https://doi.org/10.22937/IJCSNS.2021.21.2.27

Downloads

Published

2024-07-01

How to Cite

Joy, M., & Krishnaveni, M. (2024). Ensemble-based Disease Outbreak Detection: Comparative Analysis of Health News Information Retrieval Techniques. International Journal of Computing, 23(4), 274-280. https://doi.org/10.47839/ijc.23.4.3547

Issue

Section

Articles