The Contract That Changed How My Agents Think

Every agent system eventually hits the same wall: the LLM outputs look reasonable, but the code they produce is unfocused. The functions are wrong. The edge cases are missing. The tests don’t match the implementation. You look at the output and you can’t quite blame the model — the instructions were vague.

That was the Planner problem in the aco-system. After months of running the full pipeline — PM → Planner → Architect → Developer → QA — it became clear that the Planner was the bottleneck. Not because it was slow, but because its outputs had too much undefined space. A task description that said “implement user authentication” gives the Developer agent nowhere to anchor. It has to make dozens of micro-decisions that should have been made by the Planner.

The fix, which landed on April 12, was deceptively simple: make the Planner fill out a strict contract for every task.

What a Task Contract Looks Like

Before the rewrite, a Planner output looked like this:

## Task 3: [implement] User authentication
- Create a login endpoint using JWT
- Handle invalid credentials gracefully
- Store passwords as hashes

That’s readable. It’s also uselessly vague for an autonomous agent. What endpoint? What library? What’s the exact function signature? What counts as “gracefully”? What hash algorithm?

The new contract requires nine fields for every task:

title — action-oriented, prefixed [implement] or [test]
description — full implementation approach
file_path — exact file to create or modify
function_signature — exact function signature
acceptance_criteria — numbered, each mapping to one test case
test_strategy — specific framework and mock approach
technical_notes — library choices, patterns, constraints
estimate_hours — 1–16h
dependencies — task IDs this task depends on

Now the same task looks like:

## Task 3: [implement] User authentication

**file_path:** `src/auth.py`
**function_signature:** `def authenticate_user(email: str, password: str) -> dict: ...`
**description:** Create a FastAPI endpoint at POST /auth/login that validates credentials against the users table, returns a signed JWT with 24h expiry using PyJWT.
**technical_notes:** Use bcrypt via passlib.hash for password comparison. JWT secret from env AUTH_SECRET. Store expiry as Unix timestamp.
**acceptance_criteria:**
1. Valid credentials return 200 with JWT token
2. Invalid email returns 401 with {"error": "invalid_credentials"}
3. Missing fields return 422 with validation errors
4. Expired token returns 401
**test_strategy:** pytest with Respx mocking HTTP calls. Patch `env.AUTH_SECRET` with a known test secret.
**estimate_hours:** 3
**dependencies:** [task-1: user model created]

The difference is not cosmetic. The Planner now has to make the decisions — which means it has to think through the implementation before the Developer ever sees it. The Developer receives a function signature, not a direction.

The Unexpected Effect: Planner Starts Debugging Itself

When you force an LLM to commit to specific details, you discover something interesting: it starts catching its own gaps.

With the old vague format, the Planner could produce a plausible-sounding task list without ever confronting the hard questions. Does the API use a database session per request or share one? Is the JWT secret loaded at startup or fetched per-request? What’s the error response format?

The contract forces the Planner to answer those questions — or at least write something specific enough that contradictions become visible. When the contract says one thing in the description and another in the function signature, it’s a signal that the Planner hasn’t thought the task through fully. That signal was never visible before.

This is the part that surprised me most. We didn’t add a new validation step or a separate reviewer agent. We just changed the output format, and the Planner started self-correcting because the contradictions were no longer buried in prose.

Implementation + Test: An Inseparable Pair

One specific rule from the new contract: every [implement] task must be followed by a paired [test] task. Infrastructure tasks — README, requirements.txt, config files — are exempt. But logic code always gets a test partner.

This wasn’t an obvious rule. The question of where tests should live — inside the agent system repo or inside the target project repo — is still being debated. But the pairing rule itself has held up well in practice.

The reason it works: when the Planner writes a test task alongside an implementation task, it has to think about what “done” actually means. The acceptance criteria have to be testable. The edge cases have to be enumerated. Writing the test task forces clarity on the implementation task.

Quality Gates: 8 of 11 Before the Pipeline Runs

The new system has a quality checklist with 11 items. The pipeline won’t proceed until 8 are verified. This includes: architecture diagram, data flow diagram, state machine diagram, ≥5 edge cases mapped, ≥3 risks identified, and — the new ones — all tasks have file_path, function_signature, acceptance_criteria, and test_strategy.

The interesting design choice here is that the Architect agent reviews against this checklist, not against the code itself. The question isn’t “is the code correct?” — it’s “has the Planner thought through this enough that correctness is testable?”

What This Means for Agent Design

Most agent prompt engineering focuses on what the agent should do. Give it better instructions, better examples, better system prompts. That’s not wrong — but it’s incomplete.

The planner overhaul is about something different: it’s about forcing the agent to make its implicit knowledge explicit before passing work downstream. The constraint isn’t “be better.” It’s “commit to specifics.”

This is relevant beyond the aco-system. Any time you have agents passing work to other agents — or agents passing work to humans — the quality of the handoff determines everything. A vague task description is a vague result. A strict contract is a forcing function for precision.

The aco-system is still evolving. There are open questions: where do tests belong, how should the pipeline handle timeouts on MiniMax’s API, whether OpenRouter should replace MiniMax for production. But the planner overhaul moved something fundamental. The pipeline now produces tasks that the Developer can execute without needing to second-guess the instructions.

That’s the real goal: agents that hand off work as if they were handing it to a colleague who knows the domain, but not this specific task. The contract is what makes that possible.

What a Task Contract Looks Like

The Unexpected Effect: Planner Starts Debugging Itself

Implementation + Test: An Inseparable Pair

Quality Gates: 8 of 11 Before the Pipeline Runs

What This Means for Agent Design

Stay in the loop

Comments