Aether: Detecting Code Plagiarism with Transformer Embeddings
How we built an advanced code plagiarism detection system using CodeBERTa embeddings and a neural classifier, achieving an F1-score of 0.85 on real-world datasets.

Code plagiarism is a growing challenge in both academic and professional environments. Traditional detection tools struggle against modern obfuscation techniques—renaming variables, restructuring code, or swapping equivalent operations can easily fool lexical comparison methods.
In this post, I'll share our research on Aether, a code plagiarism detection system that uses transformer-based embeddings to capture the semantic meaning of code, not just its surface-level syntax.
Research Project
This post summarizes our academic research on code similarity detection using pre-trained language models. Check out the Aether project on GitHub for the full implementation.
The Problem with Traditional Detection
Classic tools like MOSS and JPlag rely on token comparisons—they look at the sequence of keywords, operators, and identifiers in your code. While effective against copy-paste plagiarism, they fail when facing:
- Variable renaming: Changing
sumtototalorcounttonumberOfItems - Code restructuring: Moving code blocks around, splitting functions
- Equivalent substitutions: Using
whileinstead offor, orswitchinstead ofif-else - Comment modifications: Adding, removing, or changing comments
Detection Gap
Studies show detection rates below 60% for sophisticated plagiarism cases using traditional tools. We needed something smarter.
Our Approach: Semantic Code Understanding
Instead of comparing text, we compare meaning. The key insight is that pre-trained language models designed for code learn deep semantic representations that capture what code does, not just what it looks like.
Why CodeBERTa?
We chose CodeBERTa-small-v1 for several reasons:
| Model | Parameters | Performance | Why We Chose It |
|---|---|---|---|
| CodeBERT | ~125M | Excellent | Too computationally expensive |
| UniXCoder | ~125M | Excellent | Similar concerns |
| CodeBERTa | ~84M | Very Good | Best balance of performance & efficiency |
| PLBART | ~140M | Good | Higher parameter count |
CodeBERTa offers:
- 768-dimensional embeddings that capture semantic features
- 6 Transformer layers with 12 attention heads each
- Training on 6 million source code files across multiple languages
- ~50% smaller than alternatives with competitive performance
System Architecture
Our detection system works in three stages:
Step 1: Generate Embeddings
Each code snippet is tokenized and passed through CodeBERTa. We use the CLS token output—a special token trained to represent the entire sequence in a single vector:
def get_embedding(code: str, model, tokenizer) -> torch.Tensor:
"""Extract semantic embedding from code using CodeBERTa."""
inputs = tokenizer(
code,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
# CLS token is at position 0
return outputs.last_hidden_state[:, 0, :]Step 2: Calculate Vector Difference
We experimented with several approaches to compare embeddings:
| Method | Pros | Cons |
|---|---|---|
| Cosine Similarity | Simple, fast | Loses directional info |
| Euclidean Distance | Captures magnitude | Sensitive to scale |
| Vector Difference | Preserves direction | Winner! |
The vector difference (embedding_A - embedding_B) preserves directional information about how the two code samples differ semantically. This approach allows our classifier to learn nuanced patterns in how similar code can be semantically different.
Step 3: Neural Classification
The magic happens in a specialized neural classifier that learns to interpret embedding differences:
class SimilarityClassifier(nn.Module):
def __init__(self, embedding_dim=768):
super().__init__()
self.classifier = nn.Sequential(
nn.Linear(embedding_dim, 32),
nn.ReLU(),
nn.Dropout(0.6),
nn.Linear(32, 16),
nn.ReLU(),
nn.Dropout(0.6),
nn.Linear(16, 1),
nn.Sigmoid()
)
def forward(self, diff):
return self.classifier(diff)Evaluation on Real-World Datasets
We tested on two complementary datasets to ensure robust evaluation:
IR-Plag Dataset
- 467 Java files covering 7 introductory programming tasks
- Six levels of plagiarism transformation (from superficial changes to deep modifications)
- Great for testing against various obfuscation techniques
- Provides controlled evaluation across different transformation types
ConPlag Dataset
- 289 real plagiarism cases from competitive programming contests
- Not artificially generated—these are actual plagiarism attempts
- More challenging and realistic
- Represents real-world scenarios where detection is most needed
Results: A 57% Improvement
Here's where it gets exciting. We compared three approaches across both datasets:
Baseline: Cosine Similarity Only
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| No Plagiarism | 0.83 | 0.83 | 0.83 |
| Plagiarism | 0.55 | 0.54 | 0.54 |
The cosine similarity baseline struggles with actual plagiarism cases—only 54% F1-score. This confirms that simple distance metrics aren't sufficient for detecting sophisticated plagiarism.
Our Classifier Model
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| No Plagiarism | 0.83 | 0.97 | 0.89 |
| Plagiarism | 0.96 | 0.77 | 0.85 |
Key Result
The neural classifier achieves 96% precision on plagiarism cases—a massive improvement! The F1-score jumps from 0.54 to 0.85, representing a 57% improvement over the baseline.
Fine-Tuning (Surprisingly Not Better)
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| No Plagiarism | 0.81 | 0.90 | 0.86 |
| Plagiarism | 0.89 | 0.77 | 0.82 |
Interestingly, unfreezing CodeBERTa for fine-tuning didn't improve results. The pre-trained representations are already excellent for this task—adding a specialized classifier on top is more effective than modifying the base model.
Key Takeaways
-
Semantic representations work: CodeBERTa embeddings capture code meaning effectively, even when syntax differs significantly.
-
Train a classifier, don't just measure distance: A specialized neural classifier dramatically outperforms simple similarity metrics—our approach achieved a 57% improvement in F1-score.
-
Smaller models can be better: CodeBERTa-small offers great results at ~50% the size of larger alternatives, making it more practical for real-world deployment.
-
Fine-tuning isn't always the answer: Sometimes the pre-trained representations are already optimal for your task. Adding a task-specific classifier can be more effective than modifying the base model.
Limitations and Future Work
Our approach has some limitations:
| Limitation | Impact | Future Work |
|---|---|---|
| 512-token limit | Long files lose information | Chunking strategies, hierarchical embeddings |
| Java-focused | Limited language support | Multi-language training, cross-language detection |
| GPU required | Higher compute cost | Model distillation, quantization techniques |
Future directions include:
- Multi-language support: Extending to Python, JavaScript, C++, and other popular languages
- Hybrid approaches: Combining semantic analysis with AST (Abstract Syntax Tree) analysis for even better detection
- Scalability: Handling code files that exceed the token limit through intelligent chunking
- Real-time detection: Optimizing for faster inference in production environments
Try It Yourself
Open Source
The Aether project is open source! Check out the GitHub repository to explore the implementation, run your own experiments, or contribute.
References
This work builds on excellent research from the community:
- Feng et al. (2020) — CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Ebrahim & Joy (2024) — Comparative evaluation of code models for similarity detection
- Flores et al. (2021) — IR-Plag: A dataset for code plagiarism detection research
- Slobodkin & Sadovnikov (2022) — ConPlag: Real-world plagiarism cases from competitive programming
Have questions about code plagiarism detection or transformer models? Feel free to reach out—I'd love to discuss this research further!
Enjoyed this article?
Explore more posts or get in touch to discuss ideas.