Large language models are surprisingly optimistic reviewers. Ask an LLM to review an implementation...
Large language models are surprisingly optimistic reviewers.
Ask an LLM to review an implementation plan and it will often approve things that are objectively wrong:
The problem is simple: the model is reasoning from its training data and the conversation context, not from your actual repository.
I wanted something different.
I wanted a reviewer whose default assumption is that the plan is wrong, and whose job is to prove it.
So I built **agent-plan-review-loop**, an open-source multi-agent orchestration system that repeatedly challenges implementation plans until they survive adversarial review.
Most AI review workflows look like this:
1. Author generates a plan
2. Reviewer checks the plan
3. Reviewer approves the plan
The problem is that both agents often share the same context and reasoning chain.
My approach intentionally breaks that connection.
Every artifact is stored as a markdown file inside the repository:
Each agent runs as a completely fresh process using Claude Code CLI.
The reviewer has no access to the author's reasoning.
It only sees:
This forces the reviewer to evaluate the plan on its own merits rather than continuing the author's thought process.
In practice, this catches a surprising number of mistakes.
The system runs an Author → Reviewer loop until approval.
Task ↓ Classifier ↓ Author ↓ Reviewer ↓ CHANGES_REQUESTED? ↓ Yes → Author revises ↓ No ↓ APPROVED ↓ Coder implements
The reviewer is intentionally adversarial.
Its primary instruction is:
You are a SKEPTICAL senior REVIEWER. Find why this plan will FAIL. Do not praise it. Default to CHANGES_REQUESTED; approve only if genuinely sound.
Instead of asking "what's good about this plan?", the reviewer asks:
The result is far more useful feedback than generic AI approval.
One challenge with agent systems is cost.
Running the most expensive model for every task quickly becomes impractical.
To solve that, I added a lightweight classification step using Haiku.
Each task is categorized before planning begins:
| Tier | Task Type | Author | Reviewer | Max Iterations |
| ---- | ----------------- | ------ | -------- | -------------- |
| T0 | Text, Config, CSS | Sonnet | Opus | 3 |
| T1 | Small Feature | Sonnet | Opus | 3 |
| T2 | Complex Refactor | Opus | Sonnet | 6 |
This allows the system to reserve expensive reasoning for genuinely difficult work.
Most r