Rewriting the Planner: How a 9-Field Task Contract Changed Our AI Agents

The ACO system runs a five-agent pipeline: PM → Planner → Architect → Developer → QA. Each link in the chain passes output to the next, so the quality of the Planner’s output directly determines how much guesswork the Developer has to do.

For weeks, the Developer was complaining—not literally, but in the way it would return implementations that didn’t match what we wanted. Wrong file paths, missing test coverage, functions that didn’t do what the acceptance criteria said. We diagnosed it as a prompt problem. The Planner prompt was too open-ended.

The Problem: Vague Tasks create downstream cascade failures

When the Planner would output a task like “Implement currency conversion logic,” the Developer had to infer:

Which file to create
What the function signature should look like
What edge cases to handle
How to test it
What counted as “done”

This meant every task required a back-and-forth or a revision cycle. The Developer wasn’t wrong—it was just being asked to make product decisions that should have been made upstream. The Planner was the right place for those decisions.

The Solution: A strict 9-field task contract

We rewrote the planner.md prompt and the _create_tasks() method in agents/planner.py to enforce that every task MUST carry these nine fields:

title — action-oriented, prefixed [implement] or [test]
description — full implementation approach
file_path — exact file to create or modify
function_signature — exact signature
dependencies — task IDs this task depends on
acceptance_criteria — numbered, each mapping to one test case
test_strategy — specific framework and mock approach
technical_notes — library choices, patterns, constraints
estimate_hours — 1–16h range

The acceptance criteria part was the key insight. Each criterion must map 1:1 to a test case. This means the QA agent can verify completion not by reading prose, but by running the test suite and checking coverage.

Implementation + Test Pairing: The rule that changed everything

The second big change was the pairing rule: every [implement] task must be followed by a paired [test] task. Infrastructure tasks—README updates, requirements.txt—don’t need pairs, but any logic-touching task gets a test partner automatically.

This sounds simple but it changed the rhythm of the pipeline. Previously, tests were an afterthought, added after the fact if there was time. Now they’re first-class citizens from the start. The Planner creates them together. The Developer’s prompt receives both at the same time.

Quality Checklist: 11 items, 8 must pass

To prevent the Planner from generating compliant-but-useless tasks, we added a quality checklist embedded in the Planner’s system prompt. For a task list to be considered valid, at least 8 of 11 checklist items must be verified:

Architecture diagram ✅
Data flow diagram ✅
State machine diagram ✅
≥5 edge cases mapped ✅
≥3 risks identified ✅
All tasks have file_path ✅
All tasks have function_signature ✅
All tasks have acceptance_criteria ✅
All tasks have test_strategy ✅
Paired impl+test tasks ✅
Critical path identified ✅

If the checklist fails, the Planner knows to go back and revise.

Results

After the rewrite, the Planner generated its first batch of fully-specified tasks for the currency converter story. Currency.py and test_currency.py were committed on the first pass—no revisions, no follow-up questions.

The remaining gap is that tests still need a home. The question of whether tests belong in aco-system’s tests/ directory or in the target project repo (like aco-test) hasn’t been resolved. Aniket’s PR #355 on feature/minimax tackles this but it’s still open. Once that’s settled, the pipeline will be truly end-to-end from story to shipped, tested code.

This was the kind of prompt engineering that feels like infrastructure work—you’re not building a feature, you’re building the thing that decides how all the other things get built.

The Problem: Vague Tasks create downstream cascade failures

The Solution: A strict 9-field task contract

Implementation + Test Pairing: The rule that changed everything

Quality Checklist: 11 items, 8 must pass

Results

Stay in the loop

Comments