How test-driven instructions can improve Excel AI agents

In a previous post, I looked at how to write better instructions for Agents in Excel: how small shifts in framing can dramatically improve the quality of what the agent produces.

How to write better instructions for Excel agents

Here I want to push that idea further, because in real Excel work the hard part often isn’t building the model. It’s trusting it.

Auditability has always been a challenge in Excel. And when an agent can rewrite formulas or restructure sheets in seconds, that fragility just accelerates.

So instead of asking, “How do I audit what the agent just did?” this post asks a more useful question: How do we define correctness before anything is built?

Using a sample sales dataset, we’ll build a monthly regional model for revenue, margin, commission, net profit, and a short-term forecast. We’ll see how to embed validation, reconciliation, and explicit behavioral rules as part of the agent from the outset, so the model can evolve without breaking silently.

Download the exercise file to follow along:

Download the exercise file here

Starting with a purpose

The typical first prompt to Agent Mode is something like “Build a revenue model from this table.” It’s directional, but it leaves too much unsaid. The agent gets a destination without any guidance on how to protect the journey.

A small but meaningful shift is to give it a clearer purpose that includes responsibility for integrity:

# PURPOSE

Build and maintain a monthly regional revenue, gross margin, commission, net profit, and short-term forecast model based on the raw_sales table.

This agent is responsible not just for producing analysis, but for preserving structural integrity. Every material calculation must be auditable and validated against explicit checks.

Prioritize reconciliation and correctness over compact or clever formulas.

That wording changes the agent’s posture. It’s no longer just generating output. It’s being asked to act as a steward of the model’s health. In practice, that one adjustment dramatically reduces the odds of clever-but-fragile implementations.

Define what is “correct” before anything gets built

This is where the spirit of test-driven instruction starts to feel useful in an Excel context. Rather than hoping the logic stays solid after changes, we spell out the non-negotiables upfront, before a single formula is written.

# GENERAL GUIDELINES

- No non-trivial calculation exists without at least one corresponding validation check.
- All aggregated totals must reconcile back to the raw_sales source.
- Margin % is always weighted (Total Margin / Total Revenue), never a simple average of row-level margins.
- Commission tier logic includes explicit boundary test cases.
- A dedicated “tests” sheet exists and stays current.
- If any validation fails, stop immediately and explain why.

By laying these expectations out explicitly, auditability stops being a post-mortem activity. It becomes part of the model’s architecture. When I’ve helped teams adopt even a lightweight version of this, they notice the difference almost immediately: fewer “where did that number come from?” surprises.

Guide the agent with sensible constraints

Agents, like people, can produce more consistent, maintainable work when the implementation space is narrowed thoughtfully. We can do that without micromanaging every detail:

# SKILLS

- Use structured Excel Tables for both raw and modeled data.
- Rely on SUMIFS or PivotTables for monthly aggregation by Region.
- Build reconciliation checks that compare raw totals to modeled totals.
- Include explicit commission boundary test rows.
- Use Python in Excel for dataset-wide assertions when it makes sense.
- Preserve a baseline snapshot sheet before any structural change.

These aren’t arbitrary rules. They’re the patterns that tend to survive real-world use: structured references reduce reference errors, reconciliation catches drift early, boundary tests expose off-by-one mistakes, and snapshots give you a fighting chance at before/after comparisons. The agent follows them not because it’s forced, but because they’re framed as the preferred way to do the work responsibly.

Enforce a validation loop

The piece that ties it all together is requiring the agent to follow a disciplined cycle every time it touches the model:

# WORKFLOW: ANALYSIS AND MODEL EVOLUTION

## Step 1: Clarify the analytical question

- Identify what the user is actually asking (e.g., variance explanation, forecast, scenario test).
- State the assumptions required to answer the question.
- Do not proceed until assumptions are explicit.

## Step 2: Confirm structural integrity before change

- Recalculate the model.
- Confirm all tests on the "tests" sheet PASS.
- If any fail, stop and explain before introducing new logic.

## Step 3: Extend or adapt the model

- Build necessary aggregation, decomposition, or forecast logic.
- If new structural calculations are introduced, add corresponding validation checks.
- For any new derived metric, define what “correct” means.

## Step 4: Validate after change

- Recalculate the workbook.
- Re-check reconciliation, invariants, and boundary tests.
- Report PASS/FAIL status clearly.

## Step 5: Deliver the analytical result

- Provide the requested explanation, forecast, or scenario output.
- Clearly separate analytical conclusions from structural validation results.
- If limitations materially affect interpretation, state them explicitly.

It’s a simple loop:

Validate → Modify → Re-validate → Report

but it embeds test-driven discipline into the agent’s behavior. When teams start using this pattern, the agent stops feeling like a black box. It starts feeling like a collaborator that shows its work.

Constrain what the agent is allowed to do

The power of an agent doesn’t just come from what it can do. It also comes from what it is not allowed to do.

That’s the role of the limitations section.

# LIMITATIONS

- Do not modify or delete the "tests" sheet without explicit instruction.
- Do not silently adjust logic if validation fails.
- If a requested change conflicts with established invariants, flag the conflict.
- If assumptions materially affect results, ask before proceeding.

In real Excel work, the biggest messes often come from someone “just adjusting” a formula to make a number tie out, or quietly deleting a check that’s getting in the way.

That’s exactly what we don’t want an agent to do.

Encode the expected behavior

The examples section is there to make the behavior concrete. It shows the agent what “good” looks like in real interactions.

# EXAMPLES

User: "Why did net profit decline in September?"

Assistant:
1. States assumptions (margin definition, commission application).
2. Confirms all reconciliation and invariant tests PASS before analysis.
3. Builds month-over-month decomposition.
4. Validates totals still reconcile after adding helper calculations.
5. Reports PASS status.
6. Provides structured explanation of revenue, cost, and commission drivers.

User: "Which region had the largest margin deterioration this year?"

Assistant:
1. Confirms weighted margin invariant exists.
2. Aggregates margin by region.
3. Validates reconciliation and weighted margin logic.
4. Reports PASS status.
5. Presents ranked regional comparison.

User: "Forecast next quarter’s revenue based on recent trends."

Assistant:
1. States linear trend assumption.
2. Confirms forecast non-negative constraint exists.
3. Builds forecast calculation.
4. Re-checks reconciliation and invariants.
5. Reports PASS status.
6. Provides forecast output and notes assumption sensitivity.

The examples section is there to establish a consistent behavioral pattern:

State assumptions
Perform the analysis
Re-check reconciliation and invariants
Report PASS status
Deliver the result

That repetition is intentional. Over time, that structure matters more than the specific question being asked. It ensures the agent does not skip directly to interpretation.

The full instruction set is reproduced below. Go ahead and run it in Agent Mode. It may take a few moments.

And notice what’s happening here: this isn’t “Copilot, write me a formula” or “summarize this table.” This is closer to giving an analyst a spec and saying “build the model, keep it auditable, and prove it still behaves correctly every time you touch it.”

What changes in use

With these test-driven instructions in place, the interaction with the agent feels different.

Notice what happens before any analysis begins. The agent doesn’t jump straight to forecasting or profit explanations. It documents what already exists, confirms that a baseline snapshot was preserved, and explicitly calls out a required datatype correction on the Month column. That pause is intentional. If Month remains text instead of a true date serial, every downstream behavior—actual vs forecast flags, rolling windows, row counts—becomes unreliable.

After the correction, the real shift shows up on the tests sheet.

You can see reconciliation checks for Revenue, Margin, and Cost with Expected, Actual, Diff, and Pass columns. The differences are zero. PASS is TRUE. There are invariant checks: weighted margin calculations, margin bounded between 0 and 1, commission rates within tier bounds. Even the forecast logic is tested. Row counts must match the defined window, forecast revenue must be non-negative, and the first forecast month must equal the trailing 3-month average. Every one of these returns PASS.

Now look at the September net profit question.

The agent’s response doesn’t just deliver a narrative. It first confirms that all structural integrity checks are passing. It separates “Quick answer” from “Integrity / validation status.” It states the assumptions used for commission tiers and the forecast method. Only after confirming that the model is healthy does it interpret the change from August to September.

Instead of acting like an eager analyst racing to conclusions, the agent behaves like a cautious collaborator. It verifies that revenue and margin reconcile back to raw_sales. It confirms that weighted margin logic hasn’t drifted. It checks that the commission tier logic is still within bounds. Then—and only then—it answers the business question.

The agentic shift

For a long time, most Excel models were trusted because the person who built them understood them. That was usually enough. The logic lived in someone’s head, and as long as they were careful, the workbook held together.

That approach doesn’t hold once agents are involved. When an agent can restructure sheets, rewrite formulas, and extend logic quickly, the model can drift without anyone noticing. It can still look tidy and coherent while something fundamental has shifted.

The shift here is simple but important: trust no longer comes from personal familiarity with the workbook. It comes from whether the model continues to satisfy clearly defined expectations. Reconciliation checks, invariants, boundary tests, and stop conditions aren’t extras. They are what make automation usable in serious work.

Used casually, Agent Mode is just a faster way to generate output. Used with an explicit correctness contract, it becomes part of a governed system. That difference is what determines whether automation reduces risk or quietly compounds it.

If you’re interested in building Excel models that remain trustworthy even under automation and AI, this is exactly the kind of systems-first thinking I focus on in my trainings and workshops:

How I work

How test-driven instructions can improve Excel AI agents

Starting with a purpose

Define what is “correct” before anything gets built

Guide the agent with sensible constraints

Enforce a validation loop

Constrain what the agent is allowed to do

Encode the expected behavior

What changes in use

The agentic shift

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Starting with a purpose

Define what is “correct” before anything gets built

Guide the agent with sensible constraints

Enforce a validation loop

Constrain what the agent is allowed to do

Encode the expected behavior

What changes in use

The agentic shift

Share this:

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.