Using Large Language Models for Data Augmentation in Text Classification Models

Authors

  • Bohdan Pavlyshenko
  • Mykola Stasiuk

DOI:

https://doi.org/10.47839/ijc.24.1.3886

Keywords:

augmentation, multi-class text classification, large language models, transformers, BERT, ALBERT, DistilBERT, XLM-RoBERTa

Abstract

This research considers the impact of data augmentation on multi-class text classification. A diverse news dataset comprising four categories was utilized for training and evaluation. Various transformer models, including BERT, DistilBERT, ALBERT, and RoBERTa, were employed to classify text across multiple categories. Based on the previous research on data augmentation, synonym replacement, antonym replacement, contextual word embedding, and the lambada method for data augmentation were chosen. Three mainstream LLMs were selected to investigate the capabilities of LLMs: LLaMA 3, GPT-4, and MistralAI. These models represent a diverse range of architectures and training data, allowing to assess the impact of different LLM capabilities on data augmentation performance. The performance of the aforementioned transformer models was evaluated using metrics such as accuracy, recall, precision, F1-score, training time, validation, and training loss. Experiments revealed that data augmentation significantly improved the performance of transformer models in text classification tasks, with lambada augmentation consistently outperforming other methods. However, model architecture and hyperparameter tuning also played a crucial role in achieving optimal results. ROBERTa, in particular, required careful hyperparameter adjustment to reach competitive performance levels. Obtained results have practical implications for developing NLP applications in low-resource languages, as data augmentation can help address the limitations of small datasets.

References

N. Patwardhan, S. Marrone, and C. Sansone, “Transformers in the real world: A survey on nlp applications,” Information, vol. 14, no. 4, p. 242, 2023. [Online]. Available: https://doi.org/10.3390/info14040242

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: Stateof-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1910.03771

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-demos.6

A. M. Bra¸soveanu and R. Andonie, “Visualizing transformers for nlp: a brief survey,” in 2020 24th international conference information visualisation (IV). IEEE, 2020, pp. 270–279. [Online]. Available: 10.1109/IV51561.2020.00051

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805

Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.1910.01108

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.1909.11942

W. X. Zhao et al., “A survey of large language models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

R. Bansal et al., “Llm augmented llms: Expanding capabilities through composition,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2401.02412

D. Saha et al., “Llm for soc security: A paradigm shift,” IEEE Access, 2024. [Online]. Available: 10.1109/ACCESS.2024.3427369

H. Tang et al., “Time series forecasting with llms: Understanding and enhancing model capabilities,” 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2402.10835

M. U. Hadi et al., “Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,” Authorea Preprints, 2023. [Online]. Available: 10.36227/techrxiv.23589741.v6

B. M. Pavlyshenko, “Analysis of disinformation and fake news detection using fine-tuned large language model,” arXiv preprint arXiv:2309.04704, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.04704

B. M. Pavlyshenko, “Forming predictive features of tweets for decisionmaking support,” in Lecture Notes in Computational Intelligence and Decision Making, S. Babichev and V. Lytvynenko, Eds. Cham: Springer International Publishing, 2022, pp. 479–490. [Online]. Available: https://doi.org/10.1007/978-3-030-82014-5_32

B. M. Pavlyshenko, “Methods of informational trends analytics and fake news detection on twitter,” arXiv preprint arXiv:2204.04891, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.04891

X.-Q. Dao and N.-B. Le, “Llms performance on vietnamese high school biology examination,” Int. J. Mod. Educ. Comp. Sci, vol. 15, pp. 14–30, 2023. [Online]. Available: https://doi.org/10.5815/ijmecs.2023.06.02

O. Duda, V. Kochan, N. Kunanets, O. Matsiuk, V. Pasichnyk, A. Sachenko, and T. Pytlenko, “Data processing in iot for smart city systems,” in 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1. IEEE, 2019, pp. 96–99. [Online]. Available: https://doi.org/10.1109/IDAACS.2019.8924262

Z. Hu, I. Dychka, K. Potapova, and V. Meliukh, “Augmenting sentiment analysis prediction in binary text classification through advanced natural language processing models and classifiers,” Int. J. Inf. Technol. Comput. Sci, vol. 16, pp. 16–31, 2024. [Online]. Available: https://doi.org/10.5815/ijitcs.2024.02.02

K. Chang, K. Wang, N. Yang, Y. Wang, D. Jin, W. Zhu, Z. Chen, C. Li, H. Yan, Y. Zhou, Z. Zhao, Y. Cheng, Y. Pan, Y. Liu, M. Wang, S. Liang, Y. Han, H. Li, and X. Li, “Data is all you need: Finetuning llms for chip design via an automated design-data augmentation framework,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3649329.3657356

N. Lee, T. Wattanawong, S. Kim, K. Mangalam, S. Shen, G. Anumanchipalli, M. W. Mahoney, K. Keutzer, and A. Gholami, “Llm2llm: Boosting llms with novel iterative data enhancement,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.15042

C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” Journal of big Data, vol. 8, no. 1, p. 101, 2021. [Online]. Available: https://doi.org/10.1186/s40537-021-00492-0

M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022. [Online]. Available: https://doi.org/10.1145/3544558

J. Cegin, J. Simko, and P. Brusilovsky, “Llms vs established text augmentation techniques for classification: When do the benefits outweight the costs?” 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2408.16502

B. Ding et al., “Data augmentation using large language models: Data perspectives, learning paradigms and challenges,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.02990

G. Sahu and I. H. Laradji, “A guide to effectively leveraging llms for low-resource text summarization: Data augmentation and semi-supervised approaches,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2407.07341

F. Piedboeuf and P. Langlais, “Is ChatGPT the ultimate data augmentation algorithm?” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 15 606–15 615. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.1044

X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in NIPS, 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1509.01626

E. Ma, “Nlp augmentation,” https://github.com/makcedward/nlpaug, 2019.

B. Pavlyshenko and M. Stasiuk, “Data augmentation in text classification with multiple categories,” Electronics and information technologies, no. 25, pp. 67–80, 2024. [Online]. Available: http://dx.doi.org/10.30970/ eli.25.6

A. Anaby-Tavor et al., “Do not have enough data? deep learning to the rescue!” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7383–7390. [Online]. Available: https://doi.org/10.48550/arXiv.1911.03118

AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https: //github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

A. Q. Jiang et al., “Mistral 7b,” 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2310.06825

J. Achiam et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2303.08774

Downloads

Published

2025-03-31

How to Cite

Pavlyshenko, B., & Stasiuk, M. (2025). Using Large Language Models for Data Augmentation in Text Classification Models. International Journal of Computing, 24(1), 148-154. https://doi.org/10.47839/ijc.24.1.3886

Issue

Section

Articles