Building Self-Improving AI Agents

OpenClaw is on its sixth version. Not because I planned six versions. Because each one taught me something the previous one got wrong. This is the story of those iterations and the hard-won insight they produced: agents don't improve by making the model smarter. They improve by building better infrastructure around the model.

Six Iterations

v1: Single Agent

The first attempt was the obvious one. One agent, one prompt, one model. Give it a task, let it figure it out. It worked for simple tasks. For anything complex, it was brittle. One misunderstood instruction and the entire task derailed. No recovery, no decomposition, no fallback. The model was doing everything: planning, executing, verifying, all in one context window. It was too much.

v2: Staged Execution

Split the work into stages. Planning stage, execution stage, output stage. Each stage had its own prompt and context. Better, but the stages didn't communicate well. The execution stage would drift from the plan. The output stage would produce something the planning stage never intended. The problem wasn't the stages. It was the handoffs between them.

v3: Verification Gates

Added verification after each stage. Before moving to execution, verify the plan is coherent. Before accepting output, verify it matches the spec. This was the first version that reliably completed complex tasks. Verification caught drift early. But it was slow. Every task went through multiple verification cycles, and the verifier sometimes rejected perfectly good work because the verification prompts were too strict.

v4: Parallel Execution

Realized that many subtasks are independent. If you're building three components, they can run in parallel. Added a dependency graph to the planning stage. Independent tasks fan out, dependent tasks wait. Throughput improved dramatically. But debugging got harder. When a parallel task failed, tracing the cause back through concurrent execution logs was painful.

v5: Prompt Engineering

Took a step back and focused on prompt clarity. Every stage got explicit contracts: what it receives, what it must produce, what constitutes success and failure. Clearer prompts reduced the need for aggressive verification. Failure rates dropped. But the system still forgot everything between sessions. Every run started from scratch. No learning.

v6: Pipeline-First with ACT-R Memory

The current version. The pipeline architecture from v4-v5, but with ACT-R cognitive memory underneath. The system remembers what worked in previous runs. Productions accumulate utility scores. Declarative memory holds past decisions, patterns, and outcomes. When the system encounters a similar task, it doesn't start from zero. It starts from experience.

$ openclaw --version-history

OpenClaw Version History
========================
v1.0  [2025-08]  Single agent, direct execution
                  Result: brittle, no recovery
v2.0  [2025-10]  Staged pipeline (plan/execute/output)
                  Result: better structure, poor handoffs
v3.0  [2025-12]  Verification gates between stages
                  Result: reliable but slow
v4.0  [2026-01]  Parallel execution with dependency graphs
                  Result: fast but hard to debug
v5.0  [2026-02]  Explicit contracts, clearer prompts
                  Result: lower failure rate, still stateless
v6.0  [2026-03]  Pipeline-first + ACT-R memory
                  Result: persistent learning, adaptive behavior

Current: v6.0 | Memory: ACT-R | Chunks: 14,208 | Productions: 847

The Key Insight

After six iterations, the pattern is clear: the model is not the bottleneck. GPT-4, Claude, Llama — they're all capable enough for most agent tasks. What determines whether an agent system works isn't the model's raw intelligence. It's the infrastructure around the model.

Structure — How tasks are decomposed and sequenced matters more than how smart the model is at each step.
Verification — An agent that checks its own work outperforms a smarter agent that doesn't. Verification loops catch errors before they compound.
Context persistence — A system that remembers its past decisions makes better future decisions. This is what ACT-R provides. Not better reasoning in the moment, but accumulated context that makes each moment start from a better place.
Clear contracts — When each pipeline stage knows exactly what it receives and what it must produce, failures are localized and recoverable.

Why Verification Loops Matter

The biggest improvement from v2 to v3 wasn't adding more capability. It was adding verification. A verification loop does three things:

Catches drift before it compounds. An error in stage 1 that propagates to stage 5 is almost impossible to fix. An error caught at stage 1 is trivial.
Creates a natural retry mechanism. If verification fails, the stage re-executes with the failure information as additional context. The model learns from its own mistake within the same run.
Provides observable checkpoints. When something goes wrong, you can look at verification results to find exactly where the system diverged from the plan.

Context Persistence Over Model Capability

This is the counterintuitive finding: upgrading the model often matters less than improving context management. A mediocre model with good context — relevant memories, past decisions, learned patterns — frequently outperforms a frontier model with no context.

This is exactly what ACT-R memory enables. The activation dynamics ensure that relevant context surfaces naturally. Spreading activation means related experiences reinforce each other during retrieval. Utility learning in the production system means behavioral patterns that worked before get preference.

The agent doesn't need to be told what worked last time. It remembers. Not because someone wrote a prompt that says "remember X." Because the cognitive architecture naturally retains and surfaces useful experience.

What Self-Improvement Actually Looks Like

True self-improvement in an agent system isn't the model rewriting its own weights. It's the accumulation of structured experience that makes future runs more effective. In OpenClaw v6:

Production rules that led to successful task completion gain utility. They're more likely to fire in similar situations.
Declarative memory chunks from successful runs stay activated. They surface as relevant context for future runs.
Failed approaches decay in activation and utility. The system naturally moves away from strategies that didn't work.
New patterns emerge from the interaction of memory and production rules. The system develops preferences and approaches that weren't explicitly programmed.

It's not artificial general intelligence. It's a system that gets measurably better at its job over time, through the same mechanisms that make human expertise possible: practice, memory, and learned judgment.

ACT-R Cognitive Architecture for AI Agents — the memory system that enables learning
Aegis Falls Architecture — the full system overview
OpenClaw — the platform these iterations built

Building Self-Improving Agents