READ

min read

Published

January 15, 2025

Author

Carlos Salguero

ai-ml

Aether: Detecting Code Plagiarism with Transformer Embeddings

How we built an advanced code plagiarism detection system using CodeBERTa embeddings and a neural classifier, achieving an F1-score of 0.85 on real-world datasets.

January 15, 20257 min read

#ai#machine-learning#transformers#code-analysis#plagiarism-detection#nlp#research

Aether: Detecting Code Plagiarism with Transformer Embeddings

Code plagiarism is a growing challenge in both academic and professional environments. Traditional detection tools struggle against modern obfuscation techniques—renaming variables, restructuring code, or swapping equivalent operations can easily fool lexical comparison methods.

In this post, I'll share our research on Aether, a code plagiarism detection system that uses transformer-based embeddings to capture the semantic meaning of code, not just its surface-level syntax.

Research Project

This post summarizes our academic research on code similarity detection using pre-trained language models. Check out the Aether project on GitHub for the full implementation.

The Problem with Traditional Detection

Classic tools like MOSS and JPlag rely on token comparisons—they look at the sequence of keywords, operators, and identifiers in your code. While effective against copy-paste plagiarism, they fail when facing:

Variable renaming: Changing sum to total or count to numberOfItems
Code restructuring: Moving code blocks around, splitting functions
Equivalent substitutions: Using while instead of for, or switch instead of if-else
Comment modifications: Adding, removing, or changing comments

Detection Gap

Studies show detection rates below 60% for sophisticated plagiarism cases using traditional tools. We needed something smarter.

Our Approach: Semantic Code Understanding

Instead of comparing text, we compare meaning. The key insight is that pre-trained language models designed for code learn deep semantic representations that capture what code does, not just what it looks like.

Why CodeBERTa?

We chose CodeBERTa-small-v1 for several reasons:

Model	Parameters	Performance	Why We Chose It
CodeBERT	~125M	Excellent	Too computationally expensive
UniXCoder	~125M	Excellent	Similar concerns
CodeBERTa	~84M	Very Good	Best balance of performance & efficiency
PLBART	~140M	Good	Higher parameter count

CodeBERTa offers:

768-dimensional embeddings that capture semantic features
6 Transformer layers with 12 attention heads each
Training on 6 million source code files across multiple languages
~50% smaller than alternatives with competitive performance

System Architecture

Our detection system works in three stages:

diagram

Step 1: Generate Embeddings

Each code snippet is tokenized and passed through CodeBERTa. We use the CLS token output—a special token trained to represent the entire sequence in a single vector:

def get_embedding(code: str, model, tokenizer) -> torch.Tensor:
    """Extract semantic embedding from code using CodeBERTa."""
    inputs = tokenizer(
        code,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    with torch.no_grad():
        outputs = model(**inputs)
    # CLS token is at position 0
    return outputs.last_hidden_state[:, 0, :]

Step 2: Calculate Vector Difference

We experimented with several approaches to compare embeddings:

Method	Pros	Cons
Cosine Similarity	Simple, fast	Loses directional info
Euclidean Distance	Captures magnitude	Sensitive to scale
Vector Difference	Preserves direction	Winner!

The vector difference (embedding_A - embedding_B) preserves directional information about how the two code samples differ semantically. This approach allows our classifier to learn nuanced patterns in how similar code can be semantically different.

Step 3: Neural Classification

The magic happens in a specialized neural classifier that learns to interpret embedding differences:

class SimilarityClassifier(nn.Module):
    def __init__(self, embedding_dim=768):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, 32),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )
 
    def forward(self, diff):
        return self.classifier(diff)

Evaluation on Real-World Datasets

We tested on two complementary datasets to ensure robust evaluation:

IR-Plag Dataset

467 Java files covering 7 introductory programming tasks
Six levels of plagiarism transformation (from superficial changes to deep modifications)
Great for testing against various obfuscation techniques
Provides controlled evaluation across different transformation types

ConPlag Dataset

289 real plagiarism cases from competitive programming contests
Not artificially generated—these are actual plagiarism attempts
More challenging and realistic
Represents real-world scenarios where detection is most needed

Results: A 57% Improvement

Here's where it gets exciting. We compared three approaches across both datasets:

Baseline: Cosine Similarity Only

Class	Precision	Recall	F1-Score
No Plagiarism	0.83	0.83	0.83
Plagiarism	0.55	0.54	0.54

The cosine similarity baseline struggles with actual plagiarism cases—only 54% F1-score. This confirms that simple distance metrics aren't sufficient for detecting sophisticated plagiarism.

Our Classifier Model

Class	Precision	Recall	F1-Score
No Plagiarism	0.83	0.97	0.89
Plagiarism	0.96	0.77	0.85

Key Result

The neural classifier achieves 96% precision on plagiarism cases—a massive improvement! The F1-score jumps from 0.54 to 0.85, representing a 57% improvement over the baseline.

Fine-Tuning (Surprisingly Not Better)

Class	Precision	Recall	F1-Score
No Plagiarism	0.81	0.90	0.86
Plagiarism	0.89	0.77	0.82

Interestingly, unfreezing CodeBERTa for fine-tuning didn't improve results. The pre-trained representations are already excellent for this task—adding a specialized classifier on top is more effective than modifying the base model.

Key Takeaways

Semantic representations work: CodeBERTa embeddings capture code meaning effectively, even when syntax differs significantly.
Train a classifier, don't just measure distance: A specialized neural classifier dramatically outperforms simple similarity metrics—our approach achieved a 57% improvement in F1-score.
Smaller models can be better: CodeBERTa-small offers great results at ~50% the size of larger alternatives, making it more practical for real-world deployment.
Fine-tuning isn't always the answer: Sometimes the pre-trained representations are already optimal for your task. Adding a task-specific classifier can be more effective than modifying the base model.

Limitations and Future Work

Our approach has some limitations:

Limitation	Impact	Future Work
512-token limit	Long files lose information	Chunking strategies, hierarchical embeddings
Java-focused	Limited language support	Multi-language training, cross-language detection
GPU required	Higher compute cost	Model distillation, quantization techniques

Future directions include:

Multi-language support: Extending to Python, JavaScript, C++, and other popular languages
Hybrid approaches: Combining semantic analysis with AST (Abstract Syntax Tree) analysis for even better detection
Scalability: Handling code files that exceed the token limit through intelligent chunking
Real-time detection: Optimizing for faster inference in production environments

Try It Yourself

Open Source

The Aether project is open source! Check out the GitHub repository to explore the implementation, run your own experiments, or contribute.

References

This work builds on excellent research from the community:

Feng et al. (2020) — CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
Ebrahim & Joy (2024) — Comparative evaluation of code models for similarity detection
Flores et al. (2021) — IR-Plag: A dataset for code plagiarism detection research
Slobodkin & Sadovnikov (2022) — ConPlag: Real-world plagiarism cases from competitive programming

Have questions about code plagiarism detection or transformer models? Feel free to reach out—I'd love to discuss this research further!

Enjoyed this article?

Explore more posts or get in touch to discuss ideas.