Where do AI models fail at finding bugs? A diagnostic benchmark evaluating fault-exposure across 1,000 real C++ pairs, 8 bug categories, and 5 frontier models.
Most coding benchmarks ask AI to write code from scratch. AlgoBugs asks a harder question: Can an AI find a hidden bug? Given a program with a logical flaw, can the AI write a test case that makes the bug visible?
Current benchmarks report a single aggregate score: "The model found 25% of bugs."
But this hides everything. Did it find all the easy bugs and none of the hard ones? Can it spot arithmetic errors, or only logical ones? A single number tells us nothing about where the reasoning breaks down.
AlgoBugs breaks bug finding down into 8 distinct categories of algorithmic flaws.
Instead of a single score, we get a diagnostic profile. We can finally see exactly where models succeed (like finding off-by-one errors) and where they completely fail (like integer overflows).
AlgoBugs is built on real human mistakes. We didn't inject artificial bugs; we mined actual broken code submitted by programmers in competitions.
We categorized the 1,000 real-world bugs into 8 distinct algorithmic flaws. Different bugs require completely different types of reasoning to find.
Each of the 1,000 pairs in AlgoBugs is a self-contained directory with exactly 4 files. Here is an example of what the AI models are provided.
Fault Exposure Rate (FER) = bugs exposed ÷ (bugs exposed + not exposed) × 100%. Compile errors and no-test-generated verdicts are excluded.
Explore the 1,000 C++ submission pairs from 40 Codeforces problems across 8 difficulty brackets. Click any problem to expand its pairs — click a pair to view the full code.
Replays of the actual benchmark run. Each test case, output, and verdict shown was produced during the real evaluation — not generated live.