Detection of Source Code Plagiarism Utilizing an Approach Based on Machine Learning

Raddam Sami Mehsen; Hiren D. Joshi

doi:10.47839/ijc.23.1.3438

Authors

Raddam Sami Mehsen
Hiren D. Joshi

DOI:

https://doi.org/10.47839/ijc.23.1.3438

Keywords:

Source code, plagiarism, machine learning, C , python, programming assignments

Abstract

Academic institutions, which often publish papers and journals, are ideal testing grounds for the efficacy of counterfeit detection methods. Plagiarism occurs when someone uses the words of another writer without giving that writer proper credit. The proliferation of freeware text editors and the increasing availability of scientific materials online have made the detection of plagiarism a pressing concern; however, the detection of plagiarism in the source code presents a particularly difficult problem. Plagiarism detection algorithms for identification systems and software source code have been the subject of numerous academic investigations. The proposed method combines TF-IDF transformations with K-means clustering to achieve a 99.2% accuracy rate when detecting instances of plagiarism in the source code. This is because it groups similar lines of code together. On the other hand, in comparison to the outcomes produced by the random forest algorithm, the ones that it generates are significantly better. The performance of the MOSS system that was already in place was inferior to that of the system that was used for 90% and 80% of the training set. When contrasting the results, some parameters for evaluation that are considered include precision, recall, and F-measure. The proposed system is implemented in Jupyter Notebook 7 and Python. Also, graphic user interface is designed and implemented to give user friendly experience to the users.

References

A. Ramírez-de-la-Cruz, G. Ramírez-de-la-Rosa, C. Sánchez-Sánchez, H. Jiménez-Salazar, C. Rodríguez-Lucatero, W. A. Luna-Ramírez, “High level features for detecting source code plagiarism across programming languages,” Proceedings of the FIRE Workshops, 2015, pp. 10-14.

G. Acampora and G. Cosma, “A fuzzy-based approach to programming language independent source-code plagiarism detection,” Proceedings of the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2015, pp. 1-8, https://doi.org/10.1109/FUZZ-IEEE.2015.7337935.

J. Itsarawisut, K. Kanjanawanishkul, “Neural network-based classification of germinated hang rice using image processing,” IETE Technical Review, vol. 36, issue 4, pp. 375-381, 2019, https://doi.org/10.1080/02564602.2018.1487806.

A. Parker and J. O. Hamblen, “Computer algorithms for plagiarism detection,” IEEE Transactions on Education, vol. 32, issue 2, pp. 337–343, 1989. https://doi.org/10.1109/13.28038.

A. Iversen, N. K. Taylor, and K. E. Brown, “Classification and verification through the combination of the multi-layer perceptron and auto-association neural networks,” Proceedings of the International Joint Conference on Neural Networks, Montreal, Canada, July 2005, pp. 1166–1171.

S. Balakrishnama, & A. Ganapathiraju, “Linear discriminant analysis – A brief tutorial,” Institute for Signal and Information Processing, vol. 11, pp. 1-8, 1998.

C. Liu, C. Chen, J. Han, and P. S. Yu, “Gplag: detection of software plagiarism by program dependence graph analysis,” Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pages 872–881. https://doi.org/10.1145/1150402.1150522.

Chris L. Evaluating ML Models: Precision, Recall, F1 and Accuracy. 2019. [Online]. Available at: https://medium.com/analytics-vidhya/evaluating-ml-models-precision-recall-f1-and-accuracy-f734e9fcc0d3.

C. Arwin and S. M. M. Tahaghoghi, “Plagiarism detection across programming languages,” Proceedings of the Twenty-Ninth Australasian Computer Science Conference (ACSC2006), 2006, vol. 48, pp. 277-286.

D. Heres, Source Code Plagiarism Detection using Machine Learning, Master's thesis, Utrecht University, 2017, pp. 1-37.

G. Biau, “Analysis of a random forests model,” Journal of Machine Learning Research, vol. 13, pp. 1063–1095, 2012`.

M. Ellis, et al., “Plagiarism detection in computer code,” 2005, pp. 1-10. [Online]. Available at: http://www.rose-hulman.edu/class/csse/faculty-staff/csse-department/seniorTheses/Matt Ellis.pdf.

V. Y. Kulkarni and P. K. Sinha, “Effective learning and classification using random forest algorithm,” International Journal of Engineering and Innovative Technology (IJEIT), vol. 3, issue 11, pp. 267–273, 2014.

C. Goutte, E. Gaussier, “A probabilistic interpretation of precision, recall and f-score, with implication for evaluation,” In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg, 2005, pp. 345–359. https://doi.org/10.1007/978-3-540-31865-1_25.

G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, (2003). KNN Model-Based Approach in Classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds) On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Springer, Berlin, Heidelberg, 2003, pp. 986–996. https://doi.org/10.1007/978-3-540-39964-3_62.

J. A. W. Faidhi and S. K. Robimox, “An empirical approach for detecting program similarity and plagiarism within a university programming environment,” Pergamon Journals Ltd, vol. 11, issue 1, pp. 11–19, 1987. ttps://doi.org/10.1016/0360-1315(87)90042-X.

J. Hage, P. Rademaker, and N. van Vugt, “A comparison of plagiarism detection tools,” Utrecht University. Utrecht, The Netherlands, no. 28, pp. 1-26, 2010.

J. Ming, F. Zhang, D. Wu, P. Liu, and S. Zhu, “Deviation-based obfuscation-resilient program equivalence checking with application to software plagiarism detection,” IEEE Transactions on Reliability, vol. 65, issue 4, pp. 1647–1664, 2016. https://doi.org/10.1109/TR.2016.2570554.

J.-H. Ji, G. Woo, and H.-G. Cho, “A source code linearization technique for detecting plagiarized programs,” Proceedings of the ITiCSE’07, Dundee, Scotland, United Kingdom, June 2007, pp. 73–77. https://doi.org/10.1145/1269900.1268807.

K. S. Kim et al., “Comparison of k-nearest neighbor, quadratic discriminant and linear discriminant analysis in classification of electromyogram signals based on the wrist-motion directions,” Current Applied Physics, vol. 11, issue 3, pp. 740–745, 2011. https://doi.org/10.1016/j.cap.2010.11.051.

K. J. Ottenstein, “An algorithmic approach to the detection and prevention of plagiarism,” Purdue University, Department of Computer Science Technical Reports, Report number 76-200, August 1976, 16 p.

L. Prechelt, G. Malpohl, and M. Philippsen, “Finding plagiarisms among a set of programs with jplag,” Journal of Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, 2002.

M. Schein and R. Paladugu, “Redundant surgical publications: tip of the iceberg?,” Surgery, vol. 129, issue 6, pp. 655–661, 2001. https://doi.org/10.1067/msy.2001.114549.

C. Manliguez, “Generalized confusion matrix for multiple classes,” pp. 1-2, 2016, https://doi.org/10.13140/RG.2.2.31150.51523.

M. Novak, M. Joy, and D. Kermek, “Source-code similarity detection and detection tools used in academia: A systematic review,” ACM Trans. Comput. Educ., vol. 19, issue 3, Article 27, pp. 1-37, 2019. https://doi.org/10.1145/3313290.

M. Ďuračíka, E. Kršáka, and P. Hrkúta, “Current trends in source code analysis, plagiarism detection and issues of analysis big datasets,” Proceedings of the International Scientific Conference on Sustainable, Modern and Safe Transport, 2017, pp. 136–141. https://doi.org/10.1016/j.proeng.2017.06.024.

P. Flach, J. Hernández-Orallo, C. Ferri, “A coherent interpretation of AUC as a measure of aggregated classification performance,” Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 657-664.

R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification for software forensics,” Proceedings of the 9th ACM Annual Conference on Genetic and Evolutionary Computation (GECCO’07), New York, NY, USA, 2007, pp. 2082–2089. https://doi.org/10.1145/1276958.1277364.

S. Engels, V. Lakshmanan, and M. Craig, “Plagiarism detection using feature-based neural networks,” ACM SIGCSE Bulletin, vol. 39, pp. 34–38, 2007. https://doi.org/10.1145/1227504.1227324.

S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing: local algorithms for document fingerprinting,” Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003, pp. 76–85. https://doi.org/10.1145/872757.872770.

U. Bandara and G. Wijayarathna, “A machine learning based tool for source code plagiarism detection,” International Journal of Machine Learning and Computing, vol. 1, issue 4, pp. 337–343, 2011. https://doi.org/10.7763/IJMLC.2011.V1.50.

A. L. Samuel, Arthur L (1959), “Some studies in machine learning using the game of checkers,” IBM Journal of Research and Development, vol. 44, no. 1-2, pp. 210–229, 1959. https://doi.org/10.1147/rd.33.0210.

H. Han & U. Chong, “Neural network based detection of drowsiness with eyes open using AR modelling,” IETE Technical Review, vol. 33, issue 5, pp. 518-524, 2016. https://doi.org/10.1080/02564602.2015.1118362.

K. Deergha Rao & D. C. Reddy, “Transputer implementation of the EKF-based learning algorithm for multilayered neural networks used in classification of EEG signals,” IETE Technical Review, vol. 14, issue 3, pp. 177-182, 1997. https://doi.org/10.1080/02564602.1997.11416668.

S. Koco and C. Capponi, “On multi-class classication through the minimization of the confusion matrix norm,” JMLR: Workshop and Conference Proceedings, 2013, pp. 277–292.

B. Martin, “Plagiarism: a misplaced emphasis,” Journal of Information Ethics, vol. 3, issue 2, pp. 36-47, 1994.

International Journal of Computing

Detection of Source Code Plagiarism Utilizing an Approach Based on Machine Learning

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information