BSc FYDP · DIU · May 2026

AlgoBugs

Where do AI models fail at finding bugs? A diagnostic benchmark evaluating fault-exposure across 1,000 real C++ pairs, 8 bug categories, and 5 frontier models.

Submission Pairs

CF Problems

LLMs Evaluated

Bug Categories

Explore the Research

↓

The Context

The Problem with Benchmarks

Most coding benchmarks ask AI to write code from scratch. AlgoBugs asks a harder question: Can an AI find a hidden bug? Given a program with a logical flaw, can the AI write a test case that makes the bug visible?

The Status Quo

Current benchmarks report a single aggregate score: "The model found 25% of bugs."

But this hides everything. Did it find all the easy bugs and none of the hard ones? Can it spot arithmetic errors, or only logical ones? A single number tells us nothing about where the reasoning breaks down.

Overall Score: 25%

(But what does this mean?)

The AlgoBugs Approach

AlgoBugs breaks bug finding down into 8 distinct categories of algorithmic flaws.

Instead of a single score, we get a diagnostic profile. We can finally see exactly where models succeed (like finding off-by-one errors) and where they completely fail (like integer overflows).

Off-by-One25.6%

Integer Overflow4.3%

Methodology

How We Built It

AlgoBugs is built on real human mistakes. We didn't inject artificial bugs; we mined actual broken code submitted by programmers in competitions.

🌐

1. Collect

We scraped real bug submissions from Codeforces.

Using a custom Chrome extension, we extracted 1,000 cases where a programmer submitted a broken solution (Wrong Answer), fixed it, and submitted an accepted solution.

🔍

2. Filter

Each pair was manually reviewed for quality.

We ensured each buggy solution contained only a single, localized logical fault, not a complete rewrite. We also verified that the buggy code passed the public sample tests (meaning it's a tricky edge case, not a trivial syntax error).

🏷️

3. Label

Every bug was classified into one of 8 types.

We categorized every bug into our taxonomy. To ensure accuracy, an independent competitive programming expert re-labeled a sample, achieving high inter-rater reliability (Cohen's κ = 0.79).

🤖

4. Evaluate

5 AI models tried to find each bug.

We provided the buggy code and problem statement to Gemini Flash, LLaMA 70B, Qwen3 Coder, DeepSeek V3, and GPT-5.1. They were asked to generate a single test input that would expose the bug.

⚖️

5. Judge

A real compiler decides — did the AI find the bug?

We don't use AI to judge AI. We compiled both the correct and buggy C++ solutions locally and ran the AI's generated test case. If the buggy solution output differed from the correct solution, the bug was successfully exposed.

The Taxonomy

The 8 Types of Bugs

We categorized the 1,000 real-world bugs into 8 distinct algorithmic flaws. Different bugs require completely different types of reasoning to find.

Tier 1: Data & Type (Arithmetic Mistakes)

T1Integer Overflow

A number gets too big for the computer to store, so it wraps around to a wrong value — like an odometer flipping back to zero.

T2Modular Arithmetic

A calculation forgets to apply the 'remainder' step, producing a number millions of digits too large.

Tier 2: Control & Logic (Logical Mistakes)

T3Off-by-One

The program counts one step too many or too few — like cutting a fence into 10 sections but only buying 9 posts.

T4Wrong Conditional

The program checks "is X greater than Y?" when it should check "is X greater than or equal to Y?"

T5Algorithmic Flaw

The strategy works for most cases but fails on a tricky input — like a shortcut that breaks when the road forks.

Tier 3: State & Edge (Structural Mistakes)

T6State / Init

The program forgets to reset its memory between rounds — like a calculator that starts with yesterday's answer still on screen.

T7Corner Case

The program works perfectly for normal inputs but crashes or gives wrong answers for edge cases like "zero items" or "one item".

T8I/O Format

The program reads inputs in the wrong order or prints outputs in the wrong format (like adding an extra space).

What does a dataset pair actually look like?

Each of the 1,000 pairs in AlgoBugs is a self-contained directory with exactly 4 files. Here is an example of what the AI models are provided.

📁 CF_2157C / pair_14/

📄 metadata.json

📝 problem_statement.md

✅ solution_correct.cpp

❌ solution_buggy.cpp

problem_statement.md

Given N, an array of integers... Output the maximum possible value. Input: First line contains T (number of test cases). Next lines contain N and the array. Constraints: 1 <= N <= 10^5

Results

Evaluation Results

Fault Exposure Rate (FER) = bugs exposed ÷ (bugs exposed + not exposed) × 100%. Compile errors and no-test-generated verdicts are excluded.

Fault Exposure Rate (%) — All Models × Prompting Strategies

What does this mean? This chart shows the overall success rate of 5 different AI models using 3 different techniques (Zero-Shot, Chain-of-Thought, Few-Shot). An overall score around 20% means that for every 10 bugs, the best models can successfully write a test case to find only 2 of them. The ceiling remains low.

* Note: GPT-5.1 evaluated 870 of 1,000 pairs due to API limitations.

Zero-Shot FER (%) — Model × Bug Category

~36%

What does this mean? This is the core of AlgoBugs. Instead of looking at the overall 20% score, look across the rows. You can clearly see dark spots (low scores) in T1 and T2 (arithmetic bugs), and bright spots in T3 (off-by-one). The AI is not uniformly bad at finding bugs; it fails at specific types of mathematical reasoning while succeeding at simple boundary checks.

Average Per-Category FER (%) — Zero-Shot, All Models

Change in FER (pp) from Zero-Shot Baseline

Key Observations

Demo

Evaluation Replay

Replays of the actual benchmark run. Each test case, output, and verdict shown was produced during the real evaluation — not generated live.

💡

What am I looking at? This replays an actual evaluation from our research. On the left: a real buggy C++ program and the correct version. On the right: pick an AI model and see what test case it generated — and whether that test exposed the bug.

Select a model to evaluate: