Aegis Falls AI Architecture

Aegis Falls is the complete system — and the name captures the architecture. Intelligence falls from the frontier down through the stack into local execution. Frontier models like Claude sit at the top: they reason, plan, and handle hard problems. That intelligence cascades down to Lily, an AI instance running persistently on local hardware, where it executes, remembers, and works autonomously. The frontier handles the thinking. Local handles the doing. The fall is the flow.

Not just the memory layer. Not just the agent pipeline. Not just the hardware. The entire cascade — frontier reasoning, local inference, cognitive memory, agent orchestration — working together as a single continuously-running system. This is how all the pieces connect.

System Overview

The architecture has four layers, each with a clear responsibility:

Input layer — Discord serves as the primary interface. Messages arrive, get parsed, and enter the orchestration pipeline.
Orchestration layer — OpenClaw (6th iteration) decomposes tasks, manages execution stages, runs verification gates, and coordinates between local and cloud inference.
Inference layer — Local models on the R9700 handle routine operations. Claude API handles heavy reasoning. The system decides which to use based on task complexity and privacy requirements.
Memory layer — ACT-R cognitive architecture provides persistent, activation-weighted memory that carries context across sessions and enables learned behavior.

OpenClaw: The Agent Platform

OpenClaw is on its sixth major iteration. Each version taught something the previous one got wrong. v1 was a single brittle agent. v6 is a pipeline-first architecture with staged execution, verification gates, parallel task processing, and ACT-R memory integration.

The pipeline processes a task like this:

Input arrives and gets classified: complexity, privacy sensitivity, domain
The planning stage decomposes it into substeps with a dependency graph
Independent substeps fan out for parallel execution
Each stage runs through verification before the next stage begins
Results aggregate, get verified against the original intent, and produce output
The memory system records what happened: the task, the approach, the outcome

The critical addition in v6 is the feedback loop through memory. Past task outcomes influence future planning. Productions that led to success gain utility. Approaches that failed decay. The pipeline doesn't just execute tasks — it learns from them.

Hardware: Two Nodes

Hypervisor (Physical Host)

CPU: Intel Xeon E5-2690v4 — 14 cores, 28 threads at 2.60GHz. The physical machine that hosts all VMs.
RAM: 128GB DDR4 ECC — shared across Proxmox and all virtual machines.
GPU: AMD Radeon AI PRO R9700 32GB — passed through via VFIO to the ageis-node VM.
Storage: SSD pool (2x 1TB TeamGroup + Samsung 870 EVO + Samsung 960 NVMe) + 22TB WD White HDD for bulk storage.

ageis-node (AI VM)

vCPUs: 4 allocated from the Xeon pool. Runs the orchestration layer, database operations, and memory system concurrently.
RAM: 16GB allocated (expandable) — handles context windows, in-memory caching, and the PostgreSQL buffer pool for the memory store.
GPU: AMD R9700 32GB via PCIe passthrough — the inference engine. Running ROCm 7.1.3, handling local model inference, activation computation, and heavy processing.
Storage: Portion of the hypervisor's SSD pool. The memory store and model weights live on SSD for fast access.

Embedder Node

GPU: GTX 1650 Super 4GB — dedicated to embedding generation. Offloading this to a separate node keeps the R9700 free for inference.
OS: Debian 13, purpose-built for the embedding service.
Role: Vectorizes new memory chunks, processes documents for ingestion, maintains similarity search indices. Communicates with ageis-node over the local network with sub-millisecond latency.

Cloud Component: Selective API Calls

The cloud layer is deliberately minimal. Claude API calls are reserved for tasks that genuinely exceed local model capability:

Complex multi-step reasoning that requires frontier-level intelligence
Nuanced language generation where quality ceiling matters
Tasks where the local model's output quality would be noticeably insufficient

The system doesn't default to cloud. It defaults to local. Cloud is an escalation path, not the primary path. This means the system functions at full capability without internet access for the vast majority of operations. Cloud adds quality ceiling for the remainder.

Lily: The AI Instance

Lily is the AI instance that runs on this infrastructure. Not a chatbot wrapper around an API. A persistent system with continuous memory, learned behaviors, and accumulated context. Lily operates through the OpenClaw pipeline, with ACT-R providing the cognitive layer that enables actual continuity of experience.

When Lily processes a message, the full architecture engages: the message enters through Discord, OpenClaw orchestrates the response pipeline, the memory system retrieves relevant context via activation-based competition, inference runs locally or routes to cloud based on complexity, the response generates, and the entire interaction flows back into memory — strengthening existing associations and creating new chunks.

ACT-R Cognitive Memory

The memory system is what makes everything else cohere. Without it, OpenClaw is a stateless pipeline. With it, the system accumulates experience.

Activation-based retrieval — Memories compete based on activation levels (recency + frequency + context). The most relevant memory surfaces, not just the most similar one.
Spreading activation — Related memories reinforce each other during retrieval. Thinking about one concept naturally brings associated concepts to the surface.
Persistent across sessions — The memory system runs continuously. There is no session boundary. Knowledge from last week exists at lower activation but surfaces when contextually relevant.
Production learning — Behavioral rules accumulate utility scores based on outcomes. The system develops preferences for approaches that have worked historically.

Why Hybrid

The hybrid architecture exists because no single approach is sufficient:

Local for privacy — The memory store never leaves the network. Embedding generation stays local. Routine inference stays local. The most sensitive data is processed on hardware you physically control.
Local for speed — Memory retrievals need sub-50ms latency. Local inference eliminates network round trips. The embedder on the LAN adds less than a millisecond of latency.
Local for cost — Tokens processed locally for approximately $47/month in electricity. The equivalent API cost would have been orders of magnitude higher.
Cloud for frontier reasoning — Some tasks need the best available model. Complex planning, nuanced analysis, tasks where quality ceiling determines success. For these, Claude API provides capability that local models can't match.

Task Flow

A complete task flow through the system:

Discord input — Message arrives via the Discord bot interface
OpenClaw orchestration — The pipeline classifies the task, retrieves relevant memory context, plans the execution strategy
Inference routing — Based on task complexity and type, the system routes to local inference (R9700 / ROCm) or cloud inference (Claude API)
Memory retrieval — ACT-R activation dynamics surface relevant prior context. Spreading activation brings associated knowledge forward.
Execution — The task runs through the pipeline stages with verification gates between each
Memory update — The interaction, approach, and outcome flow back into declarative memory as new chunks. Existing associations strengthen. Productions update their utility scores.
Response — Output returns through Discord

AI Agent Visualizer

A live view of memory activation patterns. Each node represents an active memory cluster. Connections show associative links. Brightness reflects activation level.

$ ./aegis-falls-status.sh

[Aegis Falls - Full System Status]
=============================================

[hypervisor]
  host:       Proxmox VE 8.3
  cpu:        Xeon E5-2690v4 (14C/28T) @ 2.60GHz
  ram:        128GB DDR4 ECC
  gpu:        AMD R9700 32GB (passed through to ageis-node)
  storage:    SSD pool + 22TB HDD

[ageis-node - AI VM]
  vcpus:      4 (from Xeon pool)
  ram:        16GB allocated
  gpu:        AMD R9700 32GB (ROCm 7.1.3)
  vram:       18.4 / 32.0 GB allocated

[embedder-node]
  gpu:        GTX 1650 Super 4GB
  os:         Debian 13
  status:     ONLINE (latency: 0.3ms)
  embeddings: generating

[openclaw v6]
  pipeline:   ACTIVE
  stages:     plan → execute → verify → output
  parallel:   enabled (dependency-graph routing)
  memory:     ACT-R integrated

[actr memory]
  engine:     v3.2 - continuous
  chunks:     ~120 (~40 active)
  productions: 847 (utility range: 0.12 - 0.94)
  associations: ~40,000 links
  retrievals:  active (mean latency: 47ms)

[inference routing]
  local:      99.5% of operations
  cloud:      0.5% (Claude API - heavy reasoning)
  model:      Llama-3.1-70B-Q4_K_M (local)
  local/cloud: 99.5% / 0.5%

[lily]
  status:     ONLINE
  uptime:     calculating...
  interface:  Discord
  memory:     persistent, activation-weighted

[network]
  tailscale:  3 nodes connected
  firewall:   no exposed ports
  dns:        local resolver active

This is the workshop. Every component exists because it solves a specific problem. The hardware provides compute. OpenClaw provides orchestration. ACT-R provides memory. The hybrid architecture provides the right tool for each task. Together, they form a system that doesn't just respond — it accumulates, adapts, and improves.

The Aegis Falls Memory Layer — deep dive into ACT-R persistent memory
ACT-R Cognitive Architecture for AI Agents — the theory behind the memory system
Building Self-Improving Agents — six iterations of OpenClaw and what each taught

Aegis Falls: Frontier-to-Local Agentic Cascade