Skip to main content
READ
ai-ml

Aether: Detecting Code Plagiarism with Transformer Embeddings

How we built an advanced code plagiarism detection system using CodeBERTa embeddings and a neural classifier, achieving an F1-score of 0.85 on real-world datasets.

January 15, 20257 min read
#ai#machine-learning#transformers#code-analysis#plagiarism-detection#nlp#research
Aether: Detecting Code Plagiarism with Transformer Embeddings

Code plagiarism is a growing challenge in both academic and professional environments. Traditional detection tools struggle against modern obfuscation techniques—renaming variables, restructuring code, or swapping equivalent operations can easily fool lexical comparison methods.

In this post, I'll share our research on Aether, a code plagiarism detection system that uses transformer-based embeddings to capture the semantic meaning of code, not just its surface-level syntax.

Research Project

This post summarizes our academic research on code similarity detection using pre-trained language models. Check out the Aether project on GitHub for the full implementation.


The Problem with Traditional Detection

Classic tools like MOSS and JPlag rely on token comparisons—they look at the sequence of keywords, operators, and identifiers in your code. While effective against copy-paste plagiarism, they fail when facing:

  • Variable renaming: Changing sum to total or count to numberOfItems
  • Code restructuring: Moving code blocks around, splitting functions
  • Equivalent substitutions: Using while instead of for, or switch instead of if-else
  • Comment modifications: Adding, removing, or changing comments

Detection Gap

Studies show detection rates below 60% for sophisticated plagiarism cases using traditional tools. We needed something smarter.


Our Approach: Semantic Code Understanding

Instead of comparing text, we compare meaning. The key insight is that pre-trained language models designed for code learn deep semantic representations that capture what code does, not just what it looks like.

Why CodeBERTa?

We chose CodeBERTa-small-v1 for several reasons:

ModelParametersPerformanceWhy We Chose It
CodeBERT~125MExcellentToo computationally expensive
UniXCoder~125MExcellentSimilar concerns
CodeBERTa~84MVery GoodBest balance of performance & efficiency
PLBART~140MGoodHigher parameter count

CodeBERTa offers:

  • 768-dimensional embeddings that capture semantic features
  • 6 Transformer layers with 12 attention heads each
  • Training on 6 million source code files across multiple languages
  • ~50% smaller than alternatives with competitive performance

System Architecture

Our detection system works in three stages:

diagram

Step 1: Generate Embeddings

Each code snippet is tokenized and passed through CodeBERTa. We use the CLS token output—a special token trained to represent the entire sequence in a single vector:

def get_embedding(code: str, model, tokenizer) -> torch.Tensor:
    """Extract semantic embedding from code using CodeBERTa."""
    inputs = tokenizer(
        code,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    with torch.no_grad():
        outputs = model(**inputs)
    # CLS token is at position 0
    return outputs.last_hidden_state[:, 0, :]

Step 2: Calculate Vector Difference

We experimented with several approaches to compare embeddings:

MethodProsCons
Cosine SimilaritySimple, fastLoses directional info
Euclidean DistanceCaptures magnitudeSensitive to scale
Vector DifferencePreserves directionWinner!

The vector difference (embedding_A - embedding_B) preserves directional information about how the two code samples differ semantically. This approach allows our classifier to learn nuanced patterns in how similar code can be semantically different.

Step 3: Neural Classification

The magic happens in a specialized neural classifier that learns to interpret embedding differences:

class SimilarityClassifier(nn.Module):
    def __init__(self, embedding_dim=768):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, 32),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )
 
    def forward(self, diff):
        return self.classifier(diff)

Evaluation on Real-World Datasets

We tested on two complementary datasets to ensure robust evaluation:

IR-Plag Dataset

  • 467 Java files covering 7 introductory programming tasks
  • Six levels of plagiarism transformation (from superficial changes to deep modifications)
  • Great for testing against various obfuscation techniques
  • Provides controlled evaluation across different transformation types

ConPlag Dataset

  • 289 real plagiarism cases from competitive programming contests
  • Not artificially generated—these are actual plagiarism attempts
  • More challenging and realistic
  • Represents real-world scenarios where detection is most needed

Results: A 57% Improvement

Here's where it gets exciting. We compared three approaches across both datasets:

Baseline: Cosine Similarity Only

ClassPrecisionRecallF1-Score
No Plagiarism0.830.830.83
Plagiarism0.550.540.54

The cosine similarity baseline struggles with actual plagiarism cases—only 54% F1-score. This confirms that simple distance metrics aren't sufficient for detecting sophisticated plagiarism.

Our Classifier Model

ClassPrecisionRecallF1-Score
No Plagiarism0.830.970.89
Plagiarism0.960.770.85

Key Result

The neural classifier achieves 96% precision on plagiarism cases—a massive improvement! The F1-score jumps from 0.54 to 0.85, representing a 57% improvement over the baseline.

Fine-Tuning (Surprisingly Not Better)

ClassPrecisionRecallF1-Score
No Plagiarism0.810.900.86
Plagiarism0.890.770.82

Interestingly, unfreezing CodeBERTa for fine-tuning didn't improve results. The pre-trained representations are already excellent for this task—adding a specialized classifier on top is more effective than modifying the base model.


Key Takeaways

  1. Semantic representations work: CodeBERTa embeddings capture code meaning effectively, even when syntax differs significantly.

  2. Train a classifier, don't just measure distance: A specialized neural classifier dramatically outperforms simple similarity metrics—our approach achieved a 57% improvement in F1-score.

  3. Smaller models can be better: CodeBERTa-small offers great results at ~50% the size of larger alternatives, making it more practical for real-world deployment.

  4. Fine-tuning isn't always the answer: Sometimes the pre-trained representations are already optimal for your task. Adding a task-specific classifier can be more effective than modifying the base model.


Limitations and Future Work

Our approach has some limitations:

LimitationImpactFuture Work
512-token limitLong files lose informationChunking strategies, hierarchical embeddings
Java-focusedLimited language supportMulti-language training, cross-language detection
GPU requiredHigher compute costModel distillation, quantization techniques

Future directions include:

  • Multi-language support: Extending to Python, JavaScript, C++, and other popular languages
  • Hybrid approaches: Combining semantic analysis with AST (Abstract Syntax Tree) analysis for even better detection
  • Scalability: Handling code files that exceed the token limit through intelligent chunking
  • Real-time detection: Optimizing for faster inference in production environments

Try It Yourself

Open Source

The Aether project is open source! Check out the GitHub repository to explore the implementation, run your own experiments, or contribute.


References

This work builds on excellent research from the community:

  • Feng et al. (2020) — CodeBERT: A Pre-Trained Model for Programming and Natural Languages
  • Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Ebrahim & Joy (2024) — Comparative evaluation of code models for similarity detection
  • Flores et al. (2021) — IR-Plag: A dataset for code plagiarism detection research
  • Slobodkin & Sadovnikov (2022) — ConPlag: Real-world plagiarism cases from competitive programming

Have questions about code plagiarism detection or transformer models? Feel free to reach out—I'd love to discuss this research further!

MORE

Enjoyed this article?

Explore more posts or get in touch to discuss ideas.