Detecting Source Code Similarities in Programming

In educational institutions, the problem of plagiarism is not just confined to written essays and other text-based assessments. Source code plagiarism is also a growing concern in programming-related courses (Joy et al., 2011). The availability of “web-based solutions for programming exercises” (p125) and ease at which code can be ported from one source to another, have made it easier for students to meet assessment requirements without having to wrestle with the logic and language of programming (Ahadi & Mathieson, 2019).

Current methods to determine the academic integrity of coding submissions are plagued with challenges. The ever-rising demand for coding skills in the workplace has led to droves of students enrolling in coding-related courses (Hargrave, 2018). The large numbers make manual inspection of code not only laborious but also impractical. Additionally, adjudging the degree of similarity between two segments of code through visual examination is by no means a simple endeavour. Because marking loads are often distributed among multiple assessors, it is near impossible to tell if Student A of Marker A’s load copied from Student B of Marker B’s load. Without a Turnitin equivalent, many programming educators arm themselves with an arsenal of pointed questions and interview every student – much like the viva voce for doctoral candidates. This too, is incredibly time-consuming and perhaps even a tad bit contentious: Do we fail the shy and anxious student who lacks the eloquence to verbalise the intricate workings of their programme?

Herein lies the value of automated source code detection tools. Source code detection tools align and compare each source code submission to all other submissions (Ahadi & Mathieson, 2019). Specialised algorithms within these programmes flag similarities to instructors whose aim then, is to differentiate incidental similarities from blatant plagiarism.

A commonly-used and freely-available source code detection programme is JPLAG. Created by The Karlsruhe Institute of Technology in Germany, JPLAG currently supports the following programming languages: Java, C#, C, C++, Scheme and natural language text. A recent study by Misc, Sustran, & Protic (2016) testifies to the effectiveness of JPLAG in detecting source code plagiarism.

Professor John Thangarajah, Associate Dean (Head) of Computer Science and Software Engineering at RMIT University has used JPLAG. Professor Thangarajah passes student coding submissions through a version of JPLAG installed locally on his computer. JPLAG then compares these programmes in pairs and computes a similarity value for each pair, highlighting regions of similarity. JPLAG also produces a set of HTML pages that enables Professor Thangarajah to investigate similarities in greater detail. (Please click here for an example output). He has found JPLAG to be effective in shortlisting submissions where plagiarism is suspected.

According to Professor Thangarajah, a secondary purpose of these tools is to enlighten students on the importance of working collaboratively on programming tasks. Often, ‘plagiarism’ is not due to intentional acts of dishonesty but a lack of understanding on ways of coding together. For example, if there are three students working on a programming problem, each student should be expected to ‘pull their own weight’ and assume ownership of an equal portion of the solution. Prior to combining the parts into a coherent whole, each programmer should ensure their assigned portion functions correctly through a systematic process of debugging and rigorous testing. The problem arises when one student— perhaps the more passionate, competent or domineering group member (not always one and the same) overenthusiastically does all the work.

It might be worth mentioning that not all code-copying instances are due to plagiarism – via unethical means or otherwise. According to Lee et. al (2012), a programme is considered to be plagiarised if it “has been produced from another programme without a (programmer’s) detailed understanding of the source code.” Experienced programmers and most notably advocates of the Open Source Revolution recognise the importance of reusing code to improve efficiency and productivity (DiBona & Ockman, 1999). It is often the novice programmer blindly copying code without an adequate understanding of its workings that will struggle when tackling more advanced programming problems in the later years of study or when in industry.

For more information on JPLAG and installation instructions, please refer to https://github.com/jplag/jplag/releases. Another tool that is comparable to JPLAG in effectiveness is MOSS (Misc, Sustran, & Protic. 2016). MOSS is accessible via https://theory.stanford.edu/~aiken/moss/.

References

Ahadi, A., & Mathieson, L. (2019, January). A comparison of three popular source code similarity tools for detecting student plagiarism. In Proceedings of the Twenty-First Australasian Computing Education Conference (pp. 112-117).

DiBona, C., & Ockman, S. (1999). Open sources: Voices from the open source revolution. O’Reilly Media, Inc.

Hargrave, S. (2018, October 25). Rise of the machines: why coding is the skill you have to learn. The Guardian.   https://www.theguardian.com/new-faces-of-tech/2018/oct/25/rise-of-the-machines-why-coding-is-the-skill-you-have-to-learn.

Joy, M., Cosma, G., Yau, J. Y. K., & Sinclair, J. (2011). Source code plagiarism—a student perspective. IEEE Transactions on Education54(1), 125-132.

Lee, Y. J., Lim, J. S., Ji, J. H., Cho, H. G., & Woo, G. (2012). Plagiarism Detection among Source Codes using Adaptive Methods. KSII Transactions on Internet & Information Systems6(6).

Misc, M., Sustran, Z., & Protic, J. (2016). A comparison of software tools for plagiarism detection in programming assignments. The International journal of engineering education32(2), 738-748.