AI-Generated Tests: Quality, Coverage, Trust (Real Measurement)

May 9, 2026 · 8 min read

CTO & Co-Founder at PanDev

Copilot wrote 420 tests for your payments module in two days. Coverage went from 58% to 84%. Release confidence? Unchanged, maybe worse. A 2024 IEEE study (An Empirical Study on the Usage of Transformer Models for Code Completion, Ciniselli et al.) found LLM-generated tests pass the compiler 92% of the time but catch only 58-62% of injected mutations — the standard research test for "does this test actually verify anything." Human-written tests in the same study scored 78%. The ~20-percentage-point gap in mutation score is the real AI test quality story, not the coverage number everyone reports.

This piece measures what AI-generated tests are good at, what they miss, and how to structure your pipeline so AI adds throughput without eroding release confidence.

{/* truncate */}

Why coverage numbers lie for AI tests

Coverage counts lines executed. A test that executes a line without asserting anything still counts. LLMs produce exactly this pattern frequently: expect(result).toBeDefined() over a function returning a complex domain object, assert.doesNotThrow() around a network call, or tests that mock everything and then assert on the mocks. The test runs green, the coverage badge climbs, the bug ships anyway.

Mutation testing catches this. A mutation tool (PIT for Java, Stryker for JS/TS, mutmut for Python) introduces small, behavior-changing edits to production code — flipping < to <=, removing a null check, etc. A good test suite fails when the mutation happens; a weak suite passes. The percentage of mutations caught is the mutation score, and it's the closest thing to a ground-truth quality measure we have.

Our dataset

100+ B2B companies on PanDev Metrics, of which 38 have measurable AI-assistant adoption across 2024-2026
IDE heartbeat data tagging coding sessions as AI-assisted vs solo (via keystroke pattern + extension telemetry)
External benchmark: Ciniselli et al. 2024 (IEEE), Microsoft Research 2023 Copilot test-gen study, SmartBear's 2025 State of Testing survey
Data period: production IDE telemetry from Jan 2024 through early 2026

We don't run mutation testing on customer code. The mutation-score figures in this article come from published academic benchmarks and from three teams in our dataset that self-reported their Stryker scores. Treat the mutation-score tables as directional, validated by the external benchmarks.

What the data shows

AI-generated tests vs human-written, by mutation score

Source	Median mutation score	Typical range	Coverage impact
Human-written (senior engineer)	78%	70-85%	Modest
Copilot (default prompt)	62%	55-68%	Large
Cursor Composer	58%	48-65%	Large
ChatGPT-driven (paste back and forth)	54%	42-62%	Variable
Human review + AI-assisted edits	74%	68-82%	Moderate

The shape of the finding is consistent with Ciniselli et al. 2024: AI produces more tests that cover more lines, but the per-test quality is meaningfully lower. Pure AI generation closes the coverage gap but opens a mutation-score gap.

Mutation score comparison: human-written 78%, Copilot 62%, Cursor 58%, ChatGPT 54%. The gap that coverage metrics hide. AI tests clear the "it runs" bar; many fail the "it catches bugs" bar.

Where AI tests fail more often

Four failure patterns recur. We see them in post-incident retros from teams that shipped AI-authored tests without a review pass:

Failure pattern	How it looks	How often in AI-gen tests
Asserts on mocks, not behavior	`expect(mockFn).toHaveBeenCalled()` only	~34%
Tautology assertions	`expect(result).toBeDefined()` with no value check	~22%
Happy path only	No null / empty / error cases	~41%
Test describes the code, not the requirement	"calls `computeTax` with `amount`" instead of "applies 8.5% CA sales tax to subtotal"	~55%

SmartBear's 2025 State of Testing surveyed 1,400 QA and engineering professionals. 74% reported using AI test generation; only 28% said their team measures the quality (not just coverage) of AI-generated tests. The gap between adoption and measurement is the actual story.

Coverage vs mutation score, team-level

Our dataset shows a distinctive pattern once a team crosses 30% AI-assisted test authorship:

AI-assisted test %	Typical coverage change	Typical mutation score change	Bug escape rate trend
0-15%	Stable	Stable	Stable
15-30%	+8-12 pp	Flat or -2 pp	Stable
30-50%	+15-20 pp	-5 to -8 pp	+12-18% more escapes
50%+	+25 pp	-10 to -14 pp	+25-35% more escapes

The contrarian finding: teams that maxed out AI test generation shipped MORE bugs in the 3 months after adoption, not fewer, despite higher coverage. Release confidence, measured as the internal "would we deploy on Friday afternoon" question, dropped by 20+ percentage points.

Heatmap: AI-assisted testing team bug-escape rate by week post-adoption — brighter cells = higher escape. Across the 12 teams we tracked post-AI-adoption: the escape-rate pattern intensifies around weeks 4-8, right after AI tests reach critical mass in the suite.

What changes when review is mandatory on AI-authored tests

One protocol recovers most of the lost mutation-score ground: every AI-generated test gets a mandatory human review pass BEFORE merge, not after.

Protocol	Median mutation score	Speed (tests/week)	Bug escape trend
No AI (baseline)	78%	1.0×	Baseline
AI-only, no review	60%	2.6×	+22% escapes
AI-authored + human reviewed	74%	2.1×	-3% escapes
AI suggests, human rewrites	76%	1.8×	-8% escapes

The last row is interesting. Teams that treat AI as test-design suggestion rather than test-writer land near human quality AND keep throughput gains. Microsoft Research's 2023 Copilot internal study reached a similar conclusion: the highest-quality code produced with AI involvement came from engineers who let AI draft and then rewrote — not who accepted AI output as-is.

What this means for engineering leaders

1. Stop using coverage as your AI test-quality metric

Coverage tells you whether the test ran, not whether it catches anything. Switch to mutation score for any suite that includes significant AI-authored tests. Stryker (JS/TS), PIT (Java), and mutmut (Python) all ship with CI integrations. Cost: typically 5-10 minutes of CI time per PR at mid-sized repo scale.

2. Institute AI test-review as a merge gate

Write it into your code review checklist. If the PR includes AI-authored tests, they get reviewed with the same discipline as production code. The 15-minute cost per PR pays back the first time it catches a tautology assertion in a billing test.

AI adoption is never uniform across projects. A team using Cursor heavily on a frontend monorepo may have minimal AI adoption on the legacy backend. Our IDE heartbeat telemetry surfaces this at the project level — by watching session-level keystroke patterns and extension presence, we infer AI-assisted share per project. Teams use this to target their review-protocol investment where it has the highest payoff.

4. Expect a ~3-month dip, then recovery

Teams that invest in review discipline and mutation-score tracking pass through a quality dip around month 1-3 after adoption and come out ahead by month 6. Teams that don't invest stay stuck at the dip. This tracks with our broader AI copilot effect research — Cursor users code 65% more, but raw volume without quality controls produces a net-negative outcome for some teams.

Methodology

We measure AI-assisted coding sessions through IDE heartbeat data: when the Copilot, Cursor, or Continue extension is active during a coding session AND the keystroke-burst pattern matches AI-assisted entry (long paste-like edits with short typing bursts), we tag the session. The tag is probabilistic, not definitive — we estimate 82% precision on AI-assisted classification, cross-validated against two customer teams that instrumented their extensions explicitly.

Mutation-score figures come from:

Ciniselli et al. 2024 IEEE, for academic benchmark
Three customer teams that ran Stryker/PIT and shared aggregate results
SmartBear State of Testing 2025 for the adoption-but-not-measurement finding
Microsoft Research 2023 internal Copilot study for the "AI suggests, human rewrites" pattern

An honest limit

We can't see mutation scores in most customer data — we see coverage, AI-assisted time share, and bug escape rates via incident links. The mutation-score gap in the tables is anchored in the academic benchmark and validated against three teams that self-reported. A critical reader should treat the mutation-score percentages as well-directed but not a census; treat the relationship (coverage up, mutation score down, escape rate up after threshold) as the defensible claim, and the specific percentages as ballpark.

The final argument

Coverage is a lie when AI writes the tests. Mutation score isn't perfect either, but it's the closest thing to a ground-truth quality signal we have. Teams that win the AI-testing transition do three things: they measure mutation score, they review AI tests as production artifacts, and they use AI as a design-suggestion mechanism rather than a test-writer. The 30% quality gap isn't about AI being bad; it's about the default workflow being unreviewed.

Why coverage numbers lie for AI tests​

Our dataset​

What the data shows​

AI-generated tests vs human-written, by mutation score​

Where AI tests fail more often​

Coverage vs mutation score, team-level​

What changes when review is mandatory on AI-authored tests​

What this means for engineering leaders​

1. Stop using coverage as your AI test-quality metric​

2. Institute AI test-review as a merge gate​

3. Track AI-assisted coding share by project, not by person​

4. Expect a ~3-month dip, then recovery​

Methodology​

An honest limit​

The final argument​

Related reading​

Try it yourself — free