Fine-Tuning Large Language Models for Code-Style Analysis: The Significance of Dataset Size

Andrii Holovko; Vladyslav Alieksieiev

doi:10.47839/ijc.24.1.3885

Authors

Andrii Holovko
Vladyslav Alieksieiev

DOI:

https://doi.org/10.47839/ijc.24.1.3885

Keywords:

code-style analysis, PEP-8, large language models, Llama 2, Llama 3, fine-tuning, dataset size, zero-shot learning, low-rank adaptation, QLoRA

Abstract

One aspect of a well-written codebase is its adherence to a particular code style, and Large Language Models (LLMs) can greatly assist in reviewing and adapting the code to follow the defined conventions. Because specific code-style rules are typically not known during the pre-training of the base model, additional fine-tuning is necessary. However, the exact number of training samples required to achieve optimal model performance is unclear. The significance of dataset size when fine-tuning LLMs to categorize Python code snippets as compliant or non-compliant with the specific PEP-8 indentation rule is investigated in this work. We used Low-Rank Adaptation (LoRA) and its quantized variant (QLoRA) to fine-tune the Llama 2 7B and Llama 3 8B models on datasets of varying sizes, ranging from 60 to 480 training samples. Our experiments demonstrated that the models fine-tuned with larger datasets (240 and 480 samples) achieved accuracies of up to 99%, whereas those trained with smaller datasets (60 and 120 samples) experienced overfitting and lower accuracy. Subsequent research will be based on these findings to explore the potential of LLMs and improve code readability, maintainability, and adherence to coding standards in software development. The methodology used to determine the sufficient number of training samples can also be valuable for fine-tuning LLMs in other domains where strict style or formatting conventions are required, such as legal document preparation, standardized medical reporting, or financial regulatory filings.

References

Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Quality model overview and usage, ISO/IEC 25002:2024, Mar. 2024.

T. Hovorushchenko and O. Pomorova, “Evaluation of mutual influences of software quality characteristics based ISO 25010:2011,” Proceedings of the 2016 IEEE XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, Sep. 2016, pp. 80–83. https://doi.org/10.1109/STC-CSIT.2016.7589874.

C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” in Proceedings of the ACM 40th International Conference on Software Engineering: Software Engineering in Practice, Gothenburg, Sweden, May 2018, pp. 181–190. https://doi.org/10.1145/3183519.3183525.

S. Panichella, V. Arnaoudova, M. Di Penta, and G. Antoniol, “Would static analysis tools help developers with code reviews?,” in Proceedings of the IEEE 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, QC, Canada, Mar. 2015, pp. 161–170. doi: 10.1109/SANER.2015.7081826.

H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 19, 2023, arXiv: arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288.

AI@Meta, “Llama 3 Model Card.” [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

B. M. Pavlyshenko, “Analysis of Disinformation and Fake News Detection Using Fine-Tuned Large Language Model,” Sep. 09, 2023, arXiv: arXiv:2309.04704. https://doi.org/10.48550/arXiv.2309.04704.

X. Zhu, S. Gardiner, T. Roldán, and D. Rossouw, “The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models,” Jun. 27, 2024, arXiv: arXiv:2406.19358. https://doi.org/10.48550/arXiv.2406.19358.

J. Wang et al., “Adapting LLaMA Decoder to Vision Transformer,” May 27, 2024, arXiv: arXiv:2404.06773. https://doi.org/10.48550/arXiv.2404.06773.

X. Chu, J. Su, B. Zhang, and C. Shen, “VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks,” Jul. 07, 2024, arXiv: arXiv:2403.00522. https://doi.org/10.48550/arXiv.2403.00522.

Z. Li et al., “Label Supervised LLaMA Finetuning,” Oct. 02, 2023, arXiv: arXiv:2310.01208. https://doi.org/10.48550/arXiv.2310.01208.

A. Jafari, “Comparison Study Between Token Classification and Sequence Classification in Text Classification,” Nov. 25, 2022, arXiv: arXiv:2211.13899. https://doi.org/10.48550/arXiv.2211.13899.

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning,” Sep. 04, 2023, arXiv: arXiv:2308.11148. https://doi.org/10.48550/arXiv.2308.11148.

B. Rozière et al., “Code Llama: Open Foundation Models for Code,” Jan. 31, 2024, arXiv: arXiv:2308.12950. https://doi.org/10.48550/arXiv.2308.12950.

B. Zhang, Z. Liu, C. Cherry, and O. Firat, “When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method,” Feb. 26, 2024, arXiv: arXiv:2402.17193. https://doi.org/10.48550/arXiv.2402.17193.

“Pycodestyle Error Codes.” [Online]. Available: https://pycodestyle.pycqa.org/en/latest/intro.html#error-codes

“Python Code Instructions 18K Alpaca Dataset.” [Online]. Available: https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca

pycodestyle. Python Code Quality Authority. [Online]. Available: https://github.com/PyCQA/pycodestyle

A. Holovko, “Dataset: PEP8 E111 Compliance.” 2024. [Online]. Available: https://huggingface.co/datasets/aholovko/pep8_e111_compliance

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” Jan. 04, 2019, arXiv: arXiv:1711.05101. https://doi.org/10.48550/arXiv.1711.05101.

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. https://doi.org/10.48550/arXiv.2106.09685.

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” May 23, 2023, arXiv: arXiv:2305.14314. https://doi.org/10.48550/arXiv.2305.14314 .

T. Dettmers, BitsAndBytes. (2024). [Online]. Available: https://github.com/TimDettmers/bitsandbytes

A. Hannun, J. Digani, A. Katharopoulos, and R. Collobert, MLX: Efficient and flexible machine learning on Apple silicon. (2023). [Online]. Available at: https://github.com/ml-explore

S. Raschka, Build a Large Language Model (From Scratch), MEAP v7. Manning Publications Co, 2024.

Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. Hugging Face. [Online]. Available at: https://github.com/huggingface/transformers

Llama for Sequence Classification. Hugging Face. [Online]. Available at: https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L1312

“Zero-Shot Learning in Modern NLP,” Joe Davison Blog. [Online]. Available at: https://joeddav.github.io/blog/2020/05/29/ZSL.html

W. Yin, J. Hay, and D. Roth, “Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach,” Aug. 31, 2019, arXiv: arXiv:1909.00161. https://doi.org/10.48550/arXiv.1909.00161.

R. Turner et al., “Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020,” Aug. 31, 2021, arXiv: arXiv:2104.10201. https://doi.org/10.48550/arXiv.2104.10201.

A. Holovko, Fine-Tuning LLM for Code Style Analysis in Python. (2024). Python. [Online]. Available at: https://github.com/aholovko/pycodestyle-llm

A. Defazio, A. Cutkosky, H. Mehta, and K. Mishchenko, “When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement,” Oct. 11, 2023, arXiv: arXiv:2310.07831. https://doi.org/10.48550/arXiv.2310.07831.

A. Holovko, “Dataset: PEP8 E111 Compliance Evaluation.” 2024. [Online]. Available: https://huggingface.co/datasets/aholovko/pep8_e111_compliance_eval

International Journal of Computing

Fine-Tuning Large Language Models for Code-Style Analysis: The Significance of Dataset Size

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information