Detection of Windows Portable Executable Malware using NLP Techniques and Proxy-server

Authors

  • Maksym Mishchenko
  • Mariia Dorosh

DOI:

https://doi.org/10.47839/ijc.23.4.3765

Keywords:

cybersecurity, NLP, word2vec, proxy-server, machine learning, Windows Portable Executable, malware, ssdeep, LAN

Abstract

This paper aims to investigate the effectiveness of virus detection in Windows Portable Executable file using NLP, machine learning and a computer network proxy. Selected classification performance metrics are the accuracy and F1-score of the virus type classification in a specific file and the average time spent on analyzing the file. To classify viruses, a static analysis of the Optional Header Directories section in PE file is conducted. The list of imported libraries is vectorized using the word2vec model and submitted for classification by the Random Forest Classifier, Support Vector Machine and Multilayer Perceptron models. As a result, the best training mean accuracy of 94% and F1 score of 0.94 for the Random Forest Classifier model is achieved. To determine the effectiveness of virus file detection, a local area network (LAN) of three computers and a proxy server is configured. The conducted experiments on the detection of malicious files with the use of a proxy shows request time of 2.3 seconds for Support Vector Machine, 2.28 seconds for Multilayer Perceptron and 2.6 seconds for Random Forest Classifier. For reducing delay, ssdeep based cache is introduced, which reduces delay to 2.1 seconds for Random Forest Classifier and 2.15 seconds delay for Multilayer Perceptron. The proxy classification F1 score obtained on the evaluation proxy data confirmed and outperformed the F1 score obtained on the training dataset. This gives grounds for asserting the feasibility of using a proxy server and NLP techniques to detect Windows Portable Executable malware.

References

S. Shankar, “Security Outcomes Report Volume 3”, Cisco, 2022, [Online]. Available at: https://www.cisco.com/c/dam/en/us/products/collateral/security/security-outcomes-vol-3-report.pdf?utm_medium=email&utm_source=prospect&utm_campaign=UMB-FY23-Q2-Content-Ebook-Security-Outcomes-Report-V3&utm_term=confirmation&utm_content=UMB-FY23-Q2-Content-Ebook-Security-Outcomes-Report-V3.

M. Mamoru, “Adjusting lexical features of actual proxy logs for intrusion detection,” Journal of Information Security and Applications, vol. 50, 2020. DOI: https://doi.org/10.1016/j.jisa.2019.102408.

M. Hatada, M. Akiyama, T. Matsuki, T. Kasama, “Empowering anti-malware research in Japan by sharing the MWS datasets,” Journal of Information Processing, vol. 23, pp. 579–588, 2015. DOI: https://doi.org/10.2197/ipsjjip.23.579.

A. Moshchuk, T. Bragin, D. Deville, S. Gribble, H. Levy, “Spyproxy: Execution-based detection of malicious web content,” Department of Computer Science & Engineering, University of Washington, 2007.

C. Novo, R. Morla, “Flow-based detection and proxy-based evasion of encrypted malware C2 traffic,” Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security, New York, United States, November 13, 2020, pp. 83-91. DOI: https://doi.org/10.48550/arXiv.2009.01122.

“PE Format,” Microsoft Learn, 2024, [Online]. Available at: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format.

M. V. Mishchenko, M. S. Dorosh, “Semantic analysis and classification of malware for UNIX-like operating systems with the use of machine learning methods,” Applied Aspects of Information Technology, vol. 5, no. 4, 371, 2022. DOI: https://doi.org/10.15276/aait.05.2022.25.

K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 2014, pp. 103-111. DOI: https://doi.org/10.48550/arXiv.1409.1259.

Q. Qi, L. Lin, R. Zhang and C. Xue, “MEDT: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis,” IEEE Access, vol. 10, pp. 28750-28759, 2022. DOI: https://doi.org/10.1109/ACCESS.2022.3157712.

X. Xiao, L. Wang, K. Ding, S. Xiang and C. Pan, “Deep hierarchical encoder–decoder network for image captioning,” IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2942-2956, 2019. DOI: https://doi.org/10.1109/TMM.2019.2915033.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need,” Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2017. DOI: https://doi.org/10.48550/arXiv.1706.03762.

D. Tsirmpas, I. Gkionis, G. Th. Papadopoulos, I. Mademlis, “Neural natural language processing for long texts: A survey on classification and summarization,” Engineering Applications of Artificial Intelligence, vol. 133, 2024. https://doi.org/10.1016/j.engappai.2024.108231.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” 2019. https://doi.org/10.48550/arXiv.1907.11692.

K. Clark, M. Luong, Q. V. Le, C. D. Manning, “ELECTRA: pre-training text encoders as discriminators rather than generators,” Proceedings of the International Conference on Learning Representations (ICLR), 2020. https://doi.org/10.48550/arXiv.2003.10555.

J. Sawicki, M.Ganzha, M. Paprzycki, “The state of the art of natural language processing – A Systematic automated review of NLP literature using NLP techniques,” Data Intelligence, vol. 5, issue 3, pp. 707–749, 2023. https://doi.org/10.1162/dint_a_00213.

T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient estimation of word representations in vector space,” Proceedings of Workshop at ICLR, January 2013. https://doi.org/10.48550/arXiv.1301.3781.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Information Processing Systems, vol. 26, 2013. https://doi.org/10.48550/arXiv.1310.4546.

T. Kien, S. Hiroshi, “NLP-based approaches for malware classification from API sequences,” Proceedings of the 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES), 2017, pp. 101-105. https://doi.org/10.1109/IESYS.2017.8233569.

C. Hui, N. Takashi, N. Yoshiki, “Approximate RBF Kernel SVM and its applications in pedestrian classification,” 2008, pp. 1-9. https://doi.org/10.1007/978-1-4020-8450-8_1.

D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” Proceedings of the International Conference on Learning Representations, December 2014. DOI: https://doi.org/10.48550/arXiv.1412.6980.

C. F. Ozgur, A. Javed, S. Kevser, K. Z. Hussain, “Data augmentation based malware detection using convolutional neural networks,” Peer J Computer Science, vol. 7, 2021. DOI: https://doi.org/10.48550/arXiv.2010.01862.

L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, G. Wang, “BODMAS: An open dataset for learning based temporal analysis of PE malware,” Proceedings of the 4th Deep Learning and Security Workshop, San Francisco, CA, USA, 2021, pp. 78-84. DOI: https://doi.org/10.1109/SPW53761.2021.00020.

Global Threat Activity, Microsoft, 10 May 2024, [Online]. Available at: https://www.microsoft.com/en-us/wdsi/threats.

S. Khan, M. Nauman, “Interpretable detection of malicious behavior in windows portable executables using multi-head 2D transformers,” Big Data Mining and Analytics, vol. 7, pp. 485–499, 2024. DOI: https://doi.org/10.26599/BDMA.2023.9020025.

A. Radford, and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available at: https://api.semanticscholar.org/CorpusID:49313245.

VirusShare, GitHub, Sep 2, 2020, [Online]. Available at: https://github.com/seifreed/VirusShare.

Manalyze, GitHub, Jan 3, 2024, [Online]. Available at: https://github.com/JusticeRage/Manalyze.

A. Cortesi, M. Hils, T. Kriechbaumer, “{mitmproxy}: A free and open source interactive {HTTPS} proxy,” 10 May 2024, [Online]. Available at: https://mitmproxy.org/.

E. Carrera Ventura, “pefile (Version 2023.2.7),” GitHub. [Online]. Available at: https://github.com/erocarrera/pefile.

V. Diaz, “VirusTotal malware trends report: Emerging formats and delivery techniques,” July 26, 2023, [Online]. Available at: https://blog.virustotal.com/2023/07/virustotal-malware-trends-report.html.

S. Lad, & A. Adamuthe, “Improved deep learning model for static PE files malware detection and classification,” International Journal of Computer Network and Information Security, vol. 14, pp. 14-26, 2022. https://doi.org/10.5815/ijcnis.2022.02.02.

Y. Ye, T. Li, Q. Jiang, Y. Wang, “CIMDS: Adapting postprocessing techniques of associative classification for malware detection,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 40, pp. 298–307, 2010. https://doi.org/10.1109/TSMCC.2009.2037978.

A. Koçak, E. Söğüt, M. Alkan, O. Ayhan Erdem, “Detection of different windows PE malware using machine learning methods,” Journal of Polytechnic, vol. 26, issue 3, pp. 1185-1197, 2023. https://doi.org/10.2339/politeknik.1207704.

Downloads

Published

2025-01-12

How to Cite

Mishchenko, M., & Dorosh, M. (2025). Detection of Windows Portable Executable Malware using NLP Techniques and Proxy-server. International Journal of Computing, 23(4), 663-672. https://doi.org/10.47839/ijc.23.4.3765

Issue

Section

Articles