Origin note · May 2026
Autocomplete was the last good feedback loop.
Tab meant yes. Typing through it meant no. A binary signal, on every keystroke, from the one person who knew. Then we traded the loop for autonomy — one generation at a time.
Tab was the harness.
The first coding agents that worked were not the ones that did the most. They were the ones with the tightest harness. You typed, the model predicted the next edit, and your fingers answered — tab or keep typing. Instant feedback, every keystroke, from the one person who knew whether the prediction was right.
The model looked smart because the harness around it was good. The harness was a human reacting, all day, for free.
- 1autocomplete
- 2chat
- 3task loop
- 4background
- axsignal reinjected
ax rejoins evidence
so the loop can close again without you in it full-time.
Scope grew. Feedback collapsed.
Chat agents came next. You stopped pressing tab and started describing: build me this. The agent planned, executed, showed you the result, and you reacted in natural language. Still a loop, just slower and fuzzier. The original contract was simple - you chatted, you watched, you reacted - and intelligence was literally a human reading output.
Then a whole orchestration layer grew on top of the agent. Plan a task, break it down, iterate against checks, advance when the checks pass. The names changed quarterly. The shape did not.
Each of these is another layer between the human and the output. They make the agent more capable per session. They make the per-session signal sparser. The trade is consistent across this list: more autonomy in, less reaction out.
Not every layer subtracts signal. Some push it in.
The interview-style skills — Matt Peacock’s
grillme is the sharpest example — turn the
agent into a debugger of your own intent. Before any code runs,
the model grills you for scope, terminology, decision-tree
branches, the ADR you should have already written. The output is
a tighter spec and a durable artifact, not a faster turn. I use
it. It works.
The orchestration is the same shape; the direction of signal flow is opposite. Pulling human signal into the front of the loop is good. Closing the back of the loop — what happened after the agent ran, what should change next time — is the part still missing.
Across the four generations, the overall trend held. Capability went up every time. Per-session human feedback got sparser every time. The harness was traded for autonomy, one generation at a time.
The feedback loop was not a nice-to-have. It was the thing that made the prior generation good. Pull the human out and you do not just lose supervision; you lose the signal that taught the behavior. Replace it with a self-improvement loop that has no grounding and the agent will happily reflect itself into nonsense.
Where did the signal go?
It did not go anywhere. It stopped being captured. The signal is sitting in four places, all of them already on your laptop. The missing piece is something that joins them up and reflects on them.
The example that convinced me.
It was small and stupid, which is exactly why it convinced me.
I do not like agents working on main. I want a clean main and a
worktree per task. So I did the obvious thing and wrote it into my
CLAUDE.md and AGENTS.md: always branch,
never touch main, keep the root clean.
It failed constantly. Under a full context window the agent simply lost that line. I would catch it three sessions later, working on main again, and spend the next ten minutes moving the work off. Same correction, over and over, scattered across weeks of chats.
The fix was not a firmer rule. I had already tried the firmer rule. The fix was to move the rule down the stack.
When I ingested the transcripts and ran a retro across them, the pattern was obvious in aggregate in a way it never was session to session: this rule does not survive context pressure. So the answer was to stop asking nicely and add a hook at the tool layer that blocks writes on main unless I explicitly allow it.
After that, main stays clean. Not because the agent got more disciplined, but because I stopped relying on its discipline.
Lost under context pressure. Read once, forgotten by turn forty.
Followed when the agent remembers to invoke it. Better. Not deterministic.
Deterministic. Cannot be skipped. The outcome becomes binary - touched main or did not.
git checkout main · git commit on mainGovernance
Enforced at runtime, not by prompt.
Agents are actors in your system. They need the same controls as human contributors - identity, permissions, audit trails. Treating them as autocomplete with extra steps is how you ship the wrong kind of autonomy.
Governance enforced by a system prompt - "please do not delete files", "always work on a worktree" - is a suggestion. Governance enforced at the execution layer - deny lists, scoped credentials, deterministic command blocking - is actual governance. Without it, security teams veto autonomous agents entirely. And they are right to.
The push-down-the-stack move from the previous section is this. Prose drifts. The hook does not. Enforcement is the only signal that survives context pressure, scale, and the agent's own confidence that it knows better.
Retro is only the first step.
This is why reflection on its own is not enough, and why
ax is not just a journal of retros. A retro is a
hypothesis. Left alone, hypotheses drift.
After each session, the agent leaves a small structured note: what was tried, what worked, what failed, what should change next time. Across a week, those notes accumulate. Then a bigger self-reflection pass runs over the retros and the graph: find repeated friction, propose harness changes, estimate what they would save, and ask which experiments to start.
The user still decides. The graph decides what is worth asking about. Every accepted fix becomes an experiment with checkpoints at t+7, t+30, and t+90.
Otherwise you are improving on vibes, which is the precise failure
mode ax is trying to avoid.
Why coding first.
I am starting with coding agents on purpose, and not because this is the only place the loop applies. Coding is where the ground truth is already close to the work. Tests pass or they do not. The thing merged or got reverted. The user accepted the pull request or filed a bug. The repository already contains much of the truth.
For a marketing agent you would have to plumb in analytics. For a sales agent you need CRM outcomes. For a research agent you need source quality and downstream use. Each domain has its own evidence. Coding already has the harness bolted on, so coding is where you build the reflection loop first, prove the shape, and carry it to messier domains after you trust it.
What ax is.
The stack in 2026 has compute, tools, logs, and a pile of memory bolt-ons. It still does not have a reflection step. I know this because I was the reflection step.
For months I was the one noticing the same friction across
sessions, deciding what to change, and checking weeks later
whether it helped. ax is me automating the loop I
was already closing by hand.
It ingests Claude Code and Codex transcripts, tool calls, skills, hooks, corrections, and local git history into a typed graph on your laptop. It asks for session retros while context is still warm. It lets bigger retros surface repeated friction. It turns proposed fixes into experiments and asks for verdicts later.
The goal is not to build a vague memory product. The goal is to build the agent experience layer: the local system that measures what the agent did, reflects on it, proposes improvements, and checks whether those improvements actually helped.
If you want the argument instead of the story, read the manifesto. If you want to try it, it is on GitHub, MIT licensed, and runs on your laptop. Then tell me where the shape is wrong.