Using Large Language Models for Data Augmentation in Text Classification Models
DOI:
https://doi.org/10.47839/ijc.24.1.3886Keywords:
augmentation, multi-class text classification, large language models, transformers, BERT, ALBERT, DistilBERT, XLM-RoBERTaAbstract
This research considers the impact of data augmentation on multi-class text classification. A diverse news dataset comprising four categories was utilized for training and evaluation. Various transformer models, including BERT, DistilBERT, ALBERT, and RoBERTa, were employed to classify text across multiple categories. Based on the previous research on data augmentation, synonym replacement, antonym replacement, contextual word embedding, and the lambada method for data augmentation were chosen. Three mainstream LLMs were selected to investigate the capabilities of LLMs: LLaMA 3, GPT-4, and MistralAI. These models represent a diverse range of architectures and training data, allowing to assess the impact of different LLM capabilities on data augmentation performance. The performance of the aforementioned transformer models was evaluated using metrics such as accuracy, recall, precision, F1-score, training time, validation, and training loss. Experiments revealed that data augmentation significantly improved the performance of transformer models in text classification tasks, with lambada augmentation consistently outperforming other methods. However, model architecture and hyperparameter tuning also played a crucial role in achieving optimal results. ROBERTa, in particular, required careful hyperparameter adjustment to reach competitive performance levels. Obtained results have practical implications for developing NLP applications in low-resource languages, as data augmentation can help address the limitations of small datasets.
References
N. Patwardhan, S. Marrone, and C. Sansone, “Transformers in the real world: A survey on nlp applications,” Information, vol. 14, no. 4, p. 242, 2023. [Online]. Available: https://doi.org/10.3390/info14040242
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: Stateof-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1910.03771
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-demos.6
A. M. Bra¸soveanu and R. Andonie, “Visualizing transformers for nlp: a brief survey,” in 2020 24th international conference information visualisation (IV). IEEE, 2020, pp. 270–279. [Online]. Available: 10.1109/IV51561.2020.00051
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805
Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.1910.01108
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.1909.11942
W. X. Zhao et al., “A survey of large language models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825
R. Bansal et al., “Llm augmented llms: Expanding capabilities through composition,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2401.02412
D. Saha et al., “Llm for soc security: A paradigm shift,” IEEE Access, 2024. [Online]. Available: 10.1109/ACCESS.2024.3427369
H. Tang et al., “Time series forecasting with llms: Understanding and enhancing model capabilities,” 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2402.10835
M. U. Hadi et al., “Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,” Authorea Preprints, 2023. [Online]. Available: 10.36227/techrxiv.23589741.v6
B. M. Pavlyshenko, “Analysis of disinformation and fake news detection using fine-tuned large language model,” arXiv preprint arXiv:2309.04704, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.04704
B. M. Pavlyshenko, “Forming predictive features of tweets for decisionmaking support,” in Lecture Notes in Computational Intelligence and Decision Making, S. Babichev and V. Lytvynenko, Eds. Cham: Springer International Publishing, 2022, pp. 479–490. [Online]. Available: https://doi.org/10.1007/978-3-030-82014-5_32
B. M. Pavlyshenko, “Methods of informational trends analytics and fake news detection on twitter,” arXiv preprint arXiv:2204.04891, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.04891
X.-Q. Dao and N.-B. Le, “Llms performance on vietnamese high school biology examination,” Int. J. Mod. Educ. Comp. Sci, vol. 15, pp. 14–30, 2023. [Online]. Available: https://doi.org/10.5815/ijmecs.2023.06.02
O. Duda, V. Kochan, N. Kunanets, O. Matsiuk, V. Pasichnyk, A. Sachenko, and T. Pytlenko, “Data processing in iot for smart city systems,” in 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1. IEEE, 2019, pp. 96–99. [Online]. Available: https://doi.org/10.1109/IDAACS.2019.8924262
Z. Hu, I. Dychka, K. Potapova, and V. Meliukh, “Augmenting sentiment analysis prediction in binary text classification through advanced natural language processing models and classifiers,” Int. J. Inf. Technol. Comput. Sci, vol. 16, pp. 16–31, 2024. [Online]. Available: https://doi.org/10.5815/ijitcs.2024.02.02
K. Chang, K. Wang, N. Yang, Y. Wang, D. Jin, W. Zhu, Z. Chen, C. Li, H. Yan, Y. Zhou, Z. Zhao, Y. Cheng, Y. Pan, Y. Liu, M. Wang, S. Liang, Y. Han, H. Li, and X. Li, “Data is all you need: Finetuning llms for chip design via an automated design-data augmentation framework,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3649329.3657356
N. Lee, T. Wattanawong, S. Kim, K. Mangalam, S. Shen, G. Anumanchipalli, M. W. Mahoney, K. Keutzer, and A. Gholami, “Llm2llm: Boosting llms with novel iterative data enhancement,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.15042
C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” Journal of big Data, vol. 8, no. 1, p. 101, 2021. [Online]. Available: https://doi.org/10.1186/s40537-021-00492-0
M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022. [Online]. Available: https://doi.org/10.1145/3544558
J. Cegin, J. Simko, and P. Brusilovsky, “Llms vs established text augmentation techniques for classification: When do the benefits outweight the costs?” 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2408.16502
B. Ding et al., “Data augmentation using large language models: Data perspectives, learning paradigms and challenges,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.02990
G. Sahu and I. H. Laradji, “A guide to effectively leveraging llms for low-resource text summarization: Data augmentation and semi-supervised approaches,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2407.07341
F. Piedboeuf and P. Langlais, “Is ChatGPT the ultimate data augmentation algorithm?” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 15 606–15 615. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.1044
X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in NIPS, 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1509.01626
E. Ma, “Nlp augmentation,” https://github.com/makcedward/nlpaug, 2019.
B. Pavlyshenko and M. Stasiuk, “Data augmentation in text classification with multiple categories,” Electronics and information technologies, no. 25, pp. 67–80, 2024. [Online]. Available: http://dx.doi.org/10.30970/ eli.25.6
A. Anaby-Tavor et al., “Do not have enough data? deep learning to the rescue!” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7383–7390. [Online]. Available: https://doi.org/10.48550/arXiv.1911.03118
AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https: //github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
A. Q. Jiang et al., “Mistral 7b,” 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2310.06825
J. Achiam et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2303.08774
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.