Building an AI Test Debugger

This week I paired with a dev on my team who was stuck in a familiar place:

“Works on my machine, but not in CI.”

If you've been there, you know it's a thief of joy. Hours disappear. Confidence drops. All momentum is lost.

The Setup

Here was the situation:

Testing a Windows-only Electron app
Playwright tests
Running in an Azure DevOps pipeline
Executed on a beefy self-managed Windows VM
Screenshots captured on failure

In theory, this should be enough when there's a failure. Look at the screenshot, check the stack trace, fix the issue. In practice... not even close.

The first test in this new tranche of smoke tests was failing consistently and leaving behind four artifacts that didn't agree with each other.

JUnit XML (stack trace)
A screenshot of the app (which looked to be in the correct state)
A blank screenshot (of a different resolution than the app)
A markdown file called "error context" saying there was a modal present

None of these lined up.

The test itself was simple: start from a clean state, perform an action, then open a compendium page that displays the result. There should be no dialog and no blank page.

I had a choice to make:

Help fix the test
Or build a machine that helps fix the test
drake knows

I chose the latter.

Mark 1

I knew that Playwright had more data at its disposal that we could get access to. We just had to turn it on. Enter trace activity logs.

I configured the smoke suite to output and retain them on failure so they would show up in the artifact output along with the other artifacts. This would help us better understand what happened in the test (and some of the events from the underlying app) leading up to the failure.

To consume this downstream, I created an agent that would pull data out of ADO through a combination of the ADO API and the ADO MCP Server. Since this was an agent I wanted to go all in on the MCP server but some of the data I needed was only accessible from the API (e.g., specifically the bit that made it easy to find just the failing jobs in a pipeline run).

The end result was something akin to a bash pipe command strategy. A test run creates output that can be "piped" to the next step - the agent in this case (run manually for now). The agent pulled down the artifacts & pipeline metadata, transformed it into a JSON payload, and fed it to a local model for enrichment to see if it could detect a root cause and give a confidence score.

This got me closer to what happened.

But not why.

Mark 2

So I added more context.

I figured if the agent was going to fix the test, it needed to understand the test.

I generated a lightweight abstract syntax tree (AST) from the test code using ts-morph. This added details about the test file, the failing line, the assertion, and any helper functions used by the test. It was generated as part of the test suite teardown.

I updated the agent to pull in this information as well so it got rolled up into the final JSON payload that was already being constructed for the other artifacts. At this point I started calling this agent payload a "repair context".

I fed it into a local coding agent that had access to the test source code and asked it for a patch.

The results were:

incorrect
irrelevant
laughably bad

I tried a few variations. Same result. So I went for a walk.

The Epiphany of The Staircase

I kept coming back to the same thought:

The agent just needs more context.

But I couldn't quite wrap my head around what context was missing. I also remembered that up until this point, I was purposefully not looking at the test source code. I didn't want to bias the agent.

But the current approach wasn't working. So I looked through the test source code.

On the first pass through I clocked a few things:

The tests are run sequentially
The tests reuse the application state (e.g., one global setup and teardown)
The test that was failing challenged this paradigm

My goal at this stage wasn't to double-click on the design decisions for the suite or to judge. But there were some anti-patterns that could explain the failure or potentially have contributed to noise which papered over the actual issue. Going back to my original strategy of building the machine that fixes the failing test, it became clear what the next step was.

Mark 3

Instead of adding more raw data, I added structure.

I wrote:

A contract describing how the test suite works, and common anti-patterns to avoid
A repair prompt explaining how to use this contract and the repair context

With that in place, the agent started producing better patches when I pasted in the repair context. Not perfect—but better. I was able to get past the original failure and immediately hit the next one.

Outro

The goal now is to put in reps with this agent. Ideally with speed. To do that I need to make this into a friendlier workflow. More on that in a future post.