Skip to main content

AI-Generated Tests: Quality, Coverage, Trust (Real Measurement)

· 8 min read
Artur Pan
CTO & Co-Founder at PanDev

Copilot wrote 420 tests for your payments module in two days. Coverage went from 58% to 84%. Release confidence? Unchanged, maybe worse. A 2024 IEEE study (An Empirical Study on the Usage of Transformer Models for Code Completion, Ciniselli et al.) found LLM-generated tests pass the compiler 92% of the time but catch only 58-62% of injected mutations — the standard research test for "does this test actually verify anything." Human-written tests in the same study scored 78%. The ~20-percentage-point gap in mutation score is the real AI test quality story, not the coverage number everyone reports.

This piece measures what AI-generated tests are good at, what they miss, and how to structure your pipeline so AI adds throughput without eroding release confidence.

{/* truncate */}

Why coverage numbers lie for AI tests

Coverage counts lines executed. A test that executes a line without asserting anything still counts. LLMs produce exactly this pattern frequently: expect(result).toBeDefined() over a function returning a complex domain object, assert.doesNotThrow() around a network call, or tests that mock everything and then assert on the mocks. The test runs green, the coverage badge climbs, the bug ships anyway.

Mutation testing catches this. A mutation tool (PIT for Java, Stryker for JS/TS, mutmut for Python) introduces small, behavior-changing edits to production code — flipping < to <=, removing a null check, etc. A good test suite fails when the mutation happens; a weak suite passes. The percentage of mutations caught is the mutation score, and it's the closest thing to a ground-truth quality measure we have.

Our dataset

  • 100+ B2B companies on PanDev Metrics, of which 38 have measurable AI-assistant adoption across 2024-2026
  • IDE heartbeat data tagging coding sessions as AI-assisted vs solo (via keystroke pattern + extension telemetry)
  • External benchmark: Ciniselli et al. 2024 (IEEE), Microsoft Research 2023 Copilot test-gen study, SmartBear's 2025 State of Testing survey
  • Data period: production IDE telemetry from Jan 2024 through early 2026

We don't run mutation testing on customer code. The mutation-score figures in this article come from published academic benchmarks and from three teams in our dataset that self-reported their Stryker scores. Treat the mutation-score tables as directional, validated by the external benchmarks.

What the data shows

AI-generated tests vs human-written, by mutation score

SourceMedian mutation scoreTypical rangeCoverage impact
Human-written (senior engineer)78%70-85%Modest
Copilot (default prompt)62%55-68%Large
Cursor Composer58%48-65%Large
ChatGPT-driven (paste back and forth)54%42-62%Variable
Human review + AI-assisted edits74%68-82%Moderate

The shape of the finding is consistent with Ciniselli et al. 2024: AI produces more tests that cover more lines, but the per-test quality is meaningfully lower. Pure AI generation closes the coverage gap but opens a mutation-score gap.

Mutation score comparison: human-written 78%, Copilot 62%, Cursor 58%, ChatGPT 54%. The gap that coverage metrics hide. AI tests clear the "it runs" bar; many fail the "it catches bugs" bar.

Where AI tests fail more often

Four failure patterns recur. We see them in post-incident retros from teams that shipped AI-authored tests without a review pass:

Failure patternHow it looksHow often in AI-gen tests
Asserts on mocks, not behaviorexpect(mockFn).toHaveBeenCalled() only~34%
Tautology assertionsexpect(result).toBeDefined() with no value check~22%
Happy path onlyNo null / empty / error cases~41%
Test describes the code, not the requirement"calls computeTax with amount" instead of "applies 8.5% CA sales tax to subtotal"~55%

SmartBear's 2025 State of Testing surveyed 1,400 QA and engineering professionals. 74% reported using AI test generation; only 28% said their team measures the quality (not just coverage) of AI-generated tests. The gap between adoption and measurement is the actual story.

Coverage vs mutation score, team-level

Our dataset shows a distinctive pattern once a team crosses 30% AI-assisted test authorship:

AI-assisted test %Typical coverage changeTypical mutation score changeBug escape rate trend
0-15%StableStableStable
15-30%+8-12 ppFlat or -2 ppStable
30-50%+15-20 pp-5 to -8 pp+12-18% more escapes
50%++25 pp-10 to -14 pp+25-35% more escapes

The contrarian finding: teams that maxed out AI test generation shipped MORE bugs in the 3 months after adoption, not fewer, despite higher coverage. Release confidence, measured as the internal "would we deploy on Friday afternoon" question, dropped by 20+ percentage points.

Heatmap: AI-assisted testing team bug-escape rate by week post-adoption — brighter cells = higher escape. Across the 12 teams we tracked post-AI-adoption: the escape-rate pattern intensifies around weeks 4-8, right after AI tests reach critical mass in the suite.

What changes when review is mandatory on AI-authored tests

One protocol recovers most of the lost mutation-score ground: every AI-generated test gets a mandatory human review pass BEFORE merge, not after.

ProtocolMedian mutation scoreSpeed (tests/week)Bug escape trend
No AI (baseline)78%1.0×Baseline
AI-only, no review60%2.6×+22% escapes
AI-authored + human reviewed74%2.1×-3% escapes
AI suggests, human rewrites76%1.8×-8% escapes

The last row is interesting. Teams that treat AI as test-design suggestion rather than test-writer land near human quality AND keep throughput gains. Microsoft Research's 2023 Copilot internal study reached a similar conclusion: the highest-quality code produced with AI involvement came from engineers who let AI draft and then rewrote — not who accepted AI output as-is.

What this means for engineering leaders

1. Stop using coverage as your AI test-quality metric

Coverage tells you whether the test ran, not whether it catches anything. Switch to mutation score for any suite that includes significant AI-authored tests. Stryker (JS/TS), PIT (Java), and mutmut (Python) all ship with CI integrations. Cost: typically 5-10 minutes of CI time per PR at mid-sized repo scale.

2. Institute AI test-review as a merge gate

Write it into your code review checklist. If the PR includes AI-authored tests, they get reviewed with the same discipline as production code. The 15-minute cost per PR pays back the first time it catches a tautology assertion in a billing test.

3. Track AI-assisted coding share by project, not by person

AI adoption is never uniform across projects. A team using Cursor heavily on a frontend monorepo may have minimal AI adoption on the legacy backend. Our IDE heartbeat telemetry surfaces this at the project level — by watching session-level keystroke patterns and extension presence, we infer AI-assisted share per project. Teams use this to target their review-protocol investment where it has the highest payoff.

4. Expect a ~3-month dip, then recovery

Teams that invest in review discipline and mutation-score tracking pass through a quality dip around month 1-3 after adoption and come out ahead by month 6. Teams that don't invest stay stuck at the dip. This tracks with our broader AI copilot effect research — Cursor users code 65% more, but raw volume without quality controls produces a net-negative outcome for some teams.

Methodology

We measure AI-assisted coding sessions through IDE heartbeat data: when the Copilot, Cursor, or Continue extension is active during a coding session AND the keystroke-burst pattern matches AI-assisted entry (long paste-like edits with short typing bursts), we tag the session. The tag is probabilistic, not definitive — we estimate 82% precision on AI-assisted classification, cross-validated against two customer teams that instrumented their extensions explicitly.

Mutation-score figures come from:

  • Ciniselli et al. 2024 IEEE, for academic benchmark
  • Three customer teams that ran Stryker/PIT and shared aggregate results
  • SmartBear State of Testing 2025 for the adoption-but-not-measurement finding
  • Microsoft Research 2023 internal Copilot study for the "AI suggests, human rewrites" pattern

An honest limit

We can't see mutation scores in most customer data — we see coverage, AI-assisted time share, and bug escape rates via incident links. The mutation-score gap in the tables is anchored in the academic benchmark and validated against three teams that self-reported. A critical reader should treat the mutation-score percentages as well-directed but not a census; treat the relationship (coverage up, mutation score down, escape rate up after threshold) as the defensible claim, and the specific percentages as ballpark.

The final argument

Coverage is a lie when AI writes the tests. Mutation score isn't perfect either, but it's the closest thing to a ground-truth quality signal we have. Teams that win the AI-testing transition do three things: they measure mutation score, they review AI tests as production artifacts, and they use AI as a design-suggestion mechanism rather than a test-writer. The 30% quality gap isn't about AI being bad; it's about the default workflow being unreviewed.

Try it yourself — free

Connect your IDE plugin in 2 minutes and see your real metrics. No credit card, no commitment.

Try Free