What is OpenClaw
OpenClaw is a personal AI agent orchestration platform. Not a product, not a demo — a daily-use system for autonomous task execution, development assistance, homelab management, and research. It runs on owned hardware, connects through Discord, and has been rebuilt from scratch six times because each version exposed what the previous one got wrong.
The platform is the infrastructure layer. Lily is the AI instance that runs on it — a persistent assistant with continuous memory, learned behaviors, and accumulated context. OpenClaw provides the pipeline, the tools, and the memory system. Lily is the intelligence that operates through them.
This isn't a weekend project or a proof-of-concept. OpenClaw manages real workloads daily: writing and reviewing code, orchestrating homelab services, researching topics in depth, managing files and documentation, running multi-step development tasks from spec to verification. Every interaction feeds back into the memory system, so the platform gets better at handling the work it actually does.
Why Six Versions
Each iteration wasn't planned. Each one was forced by a real limitation that couldn't be patched — only rebuilt around. The progression from v1 to v6 is a record of every wrong assumption about how autonomous AI agents should work.
v1 — Single Agent
One agent tried to do everything: parse input, plan, execute, verify. It worked for trivial tasks. Anything with more than two steps broke it. The agent would lose track of what stage it was in, hallucinate completed work, or loop endlessly on verification. Context windows filled with irrelevant history. Error handling was non-existent because the agent was supposed to "figure it out." Lesson: A single agent cannot self-manage complex workflows. The cognitive load exceeds what any model can handle in a single pass.
v2 — Staged Execution
Split the work into stages: spec, build, test. Better structure, but the contracts between stages were vague. The build stage would receive ambiguous specs. The test stage wouldn't know what success looked like. Stages existed but didn't communicate clearly. Output from one stage would be subtly wrong for the next, and the errors compounded silently. Lesson: Stages without contracts are just labels. Every handoff needs explicit input/output definitions.
v3 — Verification Gates
Added verification between every stage. Before build could start, spec had to pass a quality gate. Before output could ship, test results had to meet criteria. This caught bugs before they compounded — but the system was slow and sequential. Every task was single-threaded through the pipeline. A task with ten substeps took ten serial passes. Lesson: Verification works, but serial execution doesn't scale. The pipeline needs to process width, not just depth.
v4 — Memory Persistence
Introduced persistent context across sessions. Agents stopped repeating the same mistakes. But memory was flat — a key-value store with no decay, no relevance weighting. Old irrelevant memories polluted retrieval. The system remembered everything equally, which meant it effectively remembered nothing well. A conversation from three months ago had the same retrieval priority as one from three minutes ago. Lesson: Memory without forgetting is just a database. Useful memory requires decay, activation, and competitive retrieval.
v5 — Better Decomposition
Clearer prompt engineering, structured task decomposition, fewer wasted iterations. The pipeline was more reliable but still treated every task the same way. Simple tasks got the same heavyweight pipeline as complex ones. No parallelism, no dependency awareness. A one-line question went through the same seven-stage process as a full development task. Lesson: The pipeline needs to adapt to the task, not the other way around. Complexity classification should drive pipeline configuration.
v6 — Pipeline-First with ACT-R Memory
The version that actually works daily. Pipeline is a first-class concept with dependency-graph routing for parallel execution. Task complexity classification determines pipeline depth — simple queries skip decomposition, complex tasks get the full treatment. ACT-R cognitive memory replaces the flat store — memories decay naturally, strengthen with use, and compete for retrieval based on activation. The system learns from outcomes. Productions that led to success gain utility. Approaches that failed decay. The feedback loop through memory is what makes v6 qualitatively different from everything before it. This is the current architecture.
How the Pipeline Works
Every task flows through a defined sequence. The pipeline isn't a suggestion — it's enforced. But the depth of processing adapts to what the task actually needs.
- Input Classification — Message arrives through Discord. The system classifies it: complexity level (trivial, moderate, complex, multi-phase), privacy sensitivity, domain, estimated substep count. This classification determines how the pipeline configures itself. A simple question might skip straight to inference. A complex development task gets the full decomposition treatment.
- Planning & Decomposition — The planner breaks the task into substeps. Each substep gets a clear input contract (what it receives) and output contract (what it must produce). Dependencies between substeps are mapped into a directed acyclic graph. The planner also consults memory — if a similar task has been decomposed before, the previous approach and its outcome inform the new plan.
- Dependency Graph & Parallel Routing — Independent substeps fan out for parallel execution. Dependent substeps wait for their prerequisites. The pipeline doesn't process linearly — it processes as wide as the dependency graph allows. A task with three independent research steps and one synthesis step runs the research in parallel, then synthesizes.
- Execution with Verification Gates — Each substep executes and must pass a verification gate before its output flows downstream. The gate checks: does the output match the contract? Is the quality sufficient? Are there errors that would compound? Failed gates trigger re-execution with adjusted parameters, not immediate failure of the whole pipeline.
- Aggregation & Intent Verification — Results from all substeps aggregate. A final verification checks the assembled output against the original intent. Did the pipeline actually answer what was asked? This catches drift — cases where individual substeps succeeded but the combined result doesn't address the original question.
- Output — Response returns through Discord. Long responses get chunked. Code outputs get syntax highlighting. The interface layer handles presentation.
- Memory Recording — The entire interaction flows into ACT-R memory: the task, the decomposition approach, the execution path, the outcome, and the quality assessment. This is how the system learns. Successful approaches strengthen. Failed approaches decay. The next similar task benefits from everything this one taught.
Key Architectural Decisions
Pipeline over Monolith
A monolithic agent tries to hold the entire task in its context window. It works until the task exceeds what the model can reason about in a single pass. Every v1-era agent builder has hit this wall: the agent works perfectly on demos and falls apart on real work. Pipelines break the problem into stages where each stage has a focused scope, clear inputs, and clear outputs. No single stage needs to understand the whole task. The pipeline does. The tradeoff is complexity — you're building infrastructure instead of just writing prompts — but the payoff is reliability at scale.
Local-First Inference
99.5% of operations run on the AMD R9700 32GB via ROCm. Cloud inference (Claude API) is reserved for tasks that genuinely need frontier reasoning. This isn't ideology — it's cost, latency, privacy, and availability. The memory system alone processes millions of tokens per month. At API rates, that would cost more monthly than the GPU cost once. Local inference means the system runs 24/7 without rate limits, without per-token fees, without sending private data off-network. When the internet goes down, the system keeps working. When an API has an outage, the system doesn't notice for 99.5% of its operations.
ACT-R over a Simple Database
A database stores and retrieves. ACT-R models how memory actually works in cognitive systems. Memories have activation levels that decay over time and strengthen with use. Retrieval is competitive — memories that are recent, frequently accessed, and contextually relevant win. Spreading activation means thinking about one concept naturally surfaces related concepts. Productions accumulate utility scores based on outcomes, so the system develops preferences for approaches that have historically worked. The result: the system doesn't just remember — it forgets gracefully, recalls contextually, and learns from experience. That's the difference between a database and a memory.
Discord as Interface
Discord wasn't chosen because it's the best possible interface. It was chosen because it's already running, supports rich formatting, handles threading naturally, works from any device, and provides a conversation-style interaction model that maps well to agent communication. The alternative was building a custom web UI, which would have been months of work for a worse result. The agent pipeline doesn't care about the interface layer — messages come in as text, responses go out as text. Discord is the current frontend. It could be swapped without touching the pipeline.
The Lily Integration
OpenClaw is the platform. Lily is the intelligence running on it.
The distinction matters. OpenClaw is infrastructure: pipelines, verification gates, memory systems, tool integrations, inference routing. Lily is the persistent AI instance that operates through that infrastructure. She has accumulated context from thousands of interactions, learned behavioral preferences through production utility, and carries forward experience that informs how she approaches new tasks.
When Lily receives a message, the full stack engages:
- The message enters through Discord and hits the OpenClaw pipeline
- ACT-R retrieves relevant context — past interactions, learned preferences, domain knowledge — via activation-based competition
- The pipeline decomposes the task and routes substeps to local or cloud inference based on complexity
- Verification gates ensure output quality at every stage
- The response generates and returns through Discord
- The entire interaction records into memory, strengthening existing associations and creating new chunks
Lily accumulates experience. Conversations from weeks ago exist at lower activation but surface when contextually relevant. Approaches that worked gain production utility. The assistant doesn't reset between sessions — it carries forward everything it's learned. Over time, Lily develops a working model of the tasks she handles most often, the preferences of the person she works with, and the approaches that produce the best outcomes in this specific environment.
Key Features
Pipeline Architecture
Dependency-graph routing with parallel execution. Clear contracts between stages. Verification gates prevent error compounding. Adaptive depth based on task complexity.
Session Management
Persistent context across restarts. Resume any session exactly where it left off. No context loss, no cold starts. Session state survives reboots and updates.
ACT-R Memory
Activation-based declarative and procedural memory. Decay, spreading activation, production utility. Memories compete for retrieval based on relevance, not just recency.
Tool Integration
Filesystem, messaging, shell execution, browser automation, code analysis. Agents have the tools they need to actually do work, not just talk about it.
Hybrid Inference
Local-first with cloud escalation. R9700 handles routine ops. Claude API handles frontier reasoning. The system decides per-task based on complexity classification.
Discord Interface
Natural language task input. Issue commands, check status, review output, manage sessions — all from one interface that works from any device.
Technologies
Related Writing
- Aegis Falls: Frontier-to-Local Agentic Cascade — full system architecture overview
- Building Self-Improving Agents — lessons from six iterations
- ACT-R Cognitive Architecture for AI Agents — the memory system underneath
Status
Daily driver. Used for everything from building this website to homelab management to automating development tasks to research assistance. If something needs doing, it goes through OpenClaw. If the pipeline learns something from doing it, the next time is faster. The system has been running continuously for weeks at a time, processing dozens of tasks per day, accumulating experience that makes each subsequent task more efficient.
$ openclaw status [OpenClaw v6 - Pipeline Status] ============================================= agent: architect pipeline: CLASSIFY → PLAN → DECOMPOSE → EXECUTE → VERIFY → OUTPUT current stage: EXECUTE (3 substeps parallel) session: portfolio-rebuild routing: local-first (R9700 / ROCm 7.1.3) [Memory] engine: ACT-R v3.2 - continuous chunks: ~120 declarative (~40 active) productions: 847 (utility range: 0.12 - 0.94) associations: ~40,000 links retrievals: active (mean latency: 47ms) [Pipeline Stats - Last 24h] tasks: 47 completed, 3 in-progress parallel: enabled (avg 2.8 concurrent substeps) verification: 94.2% first-pass rate gate fails: 12 (all recovered on re-execution) inference: 98.7% local / 1.3% cloud tokens: 614K local | 28K cloud [Lily] status: ONLINE uptime: continuous interface: Discord memory: persistent, activation-weighted experience: accumulating productions: learning (utility drift: +0.03/day) $ _