The system under test changed. The thinking didn't.

There’s a version of the AI-in-testing story that goes like this: the model writes the test, the CI pipeline runs it, and the QA engineer goes off to do something more interesting. Test coverage goes up. Bugs go down. Everyone wins.

It’s a seductive story. It’s also missing the hard part.

The thing that actually broke

[TODO: Opening personal anchor — the moment when you realized the shift wasn’t about test generation but about test strategy]

When I started paying attention to how teams were actually integrating LLM-based test tooling, the failure mode wasn’t what I expected. The tests ran. Coverage went up. And then, quietly, the system got less reliable — not more.

The reason: the teams had automated the output while leaving the input unchanged. They’d given AI the job of writing tests against a specification. What they hadn’t changed was how they thought about what to specify.

Quality is a modeling problem

A test is not a quality mechanism. It’s evidence that a model of behavior matches observed behavior at a point in time. The test is only as good as the model.

This matters because:

LLMs are very good at expressing a model as code. Given a clear behavioral description, a well-prompted model produces reasonable test coverage quickly.
LLMs are mediocre at building the model. They don’t know which edge cases are load-bearing. They don’t know about the implicit contract between System A and System B that exists only in the memory of the engineer who built them two years ago.
The skills that catch bad models are different from the skills that write test code. And they’re the skills that are easiest to let atrophy when AI is handling the syntax.

[TODO: If you have data on test coverage vs defect escape rates from your own teams or public studies — this is the right place for it. Do not invent numbers.]

What systems thinkers do differently

[TODO: 2–3 paragraphs on the concrete practice — how you approach risk modeling, what questions you ask, how you distinguish between tests that reveal information and tests that just increase a coverage metric]

The teams that use AI well for testing share a common habit: they spend more time on test strategy after adopting the tooling, not less. Because the bottleneck shifted. It used to be: can we afford to write these tests? Now it’s: do we know what to test?

That’s a harder question. It requires understanding failure modes. It requires talking to users. It requires the kind of thinking that doesn’t compress into a prompt.

The skills that compound

The engineers I’ve watched build genuinely reliable systems with AI-assisted testing have developed a specific muscle: they can read a test suite and know, quickly, whether it’s testing the system or testing the implementation.

Testing the system means: if this behavior changes in production, this test fails.

Testing the implementation means: if someone refactors this function, this test fails — even if nothing changed from the user’s perspective.

AI-generated tests, without guidance, skew heavily toward the second kind. Because the implementation is what’s visible in the code. The system is what’s visible in production.

[TODO: Close the loop — what you help teams with, and why this framing is the basis for the consulting/speaking work]

This essay was adapted from talks I’ve given on quality strategy in AI-era engineering teams. If your team is navigating this transition, get in touch.