AI Persistent Memory - Aegis Falls

The Aegis Falls AI system has a memory problem that every long-running agent eventually hits: how do you make an AI that actually remembers? Not "retrieval-augmented generation" where you stuff context into a prompt. Real memory. Persistent, activation-weighted, context-sensitive memory that strengthens with use and fades when irrelevant.

This is a workshop log of how the memory layer works and why it's built the way it is.

Persistent Memory via ACT-R

The memory system implements ACT-R cognitive architecture. Every piece of stored knowledge is a chunk with an activation level. Activation isn't a static score assigned at creation. It's a living value that changes continuously based on two forces:

Recency and frequency — Each time a memory is accessed, its base-level activation increases. Access it often and recently, and it stays highly activated. Leave it untouched and it decays logarithmically. The memory doesn't disappear, but it becomes harder to retrieve. Just like yours.
Spreading activation — Memories don't exist in isolation. They're linked by associative connections. When one memory is retrieved, it sends activation energy to related memories through these links. Think about a project and the tools, decisions, and problems associated with it all become slightly more accessible. This is how context shapes retrieval without explicit filtering.

The result is a memory system that behaves organically. Important, frequently-used knowledge stays sharp. Rarely-needed information sinks to the bottom but can be revived if context pulls it back up. The system self-curates without manual intervention.

Activation Dynamics in Practice

When the system processes a new interaction, the retrieval pipeline does the following:

The current context is encoded and placed in working memory buffers
Buffer contents send spreading activation to associated chunks in declarative memory
Each candidate chunk's activation is computed: base-level + spreading + noise
Chunks compete. The highest-activation chunk that exceeds the retrieval threshold wins
The retrieved chunk enters the retrieval buffer and influences subsequent reasoning

This means the same query can return different results depending on what else is in working memory. Context matters. A question asked while discussing infrastructure pulls up different memories than the same question asked during a creative writing session. That's not a bug. That's the point.

The Hardware: Local Inference

The memory layer runs on local hardware. The primary inference node carries an AMD Radeon Pro R9700 with 32GB VRAM, running ROCm on a dedicated Proxmox VM. This handles embedding generation, activation computation, and local model inference for fast, private operations.

The choice to run locally isn't ideological purity. It's practical:

Latency — Memory retrievals need to be fast. Sub-second. Routing every retrieval through a cloud API adds network latency that compounds in a pipeline with multiple retrieval steps.
Privacy — The memory store is the most sensitive component of the system. It accumulates everything. Keeping it on-premises means the data never leaves the network.
Cost — The memory system runs continuously. It's not session-based. Hundreds of retrievals per hour, 24/7. Cloud API costs for that volume would be significant.

The Hybrid Approach

Local handles the fast, frequent, private operations: embeddings, memory retrieval, activation computation, lightweight inference. But some tasks need frontier-level reasoning. Complex multi-step analysis. Nuanced language generation. Tasks where the quality ceiling of a local model isn't enough.

For those, the system makes selective calls to the Claude API. This isn't a fallback. It's an intentional architectural decision. The local system handles the vast majority of operations. The cloud handles the fraction that genuinely need frontier capability. The memory layer is what connects them: context retrieved locally feeds into cloud reasoning, and the results flow back into local memory.

Continuous Operation

The memory system doesn't start and stop with sessions. It runs continuously on the aegis-node infrastructure. Activation levels decay in real time. New associations form as information arrives. The production system's utility scores update based on outcomes.

This is fundamentally different from session-based AI. There is no "new conversation" that wipes the slate. The system has a continuous thread of experience, weighted by relevance and recency. Information from last week is still there, just at lower activation. Trigger the right context and it surfaces.

$ python3 memory_status.py

[Aegis Falls Memory Layer]
  engine:          ACT-R v3.2
  status:          ONLINE - continuous
  uptime:          calculating...

[Declarative Memory]
  total chunks:    ~120
  active (>0.5):   ~40
  dormant (<0.1):  decaying naturally
  mean activation: 1.24
  decay rate:      0.031/hr

[Retrieval Stats]
  queries:         active
  mean latency:    47ms
  cache hit rate:  0.73
  threshold:       0.35

[Spreading Activation]
  active sources:  4 buffers
  associations:    ~40,000 links
  mean fan:        3.4

[Hardware]
  inference:       AMD R9700 32GB (ROCm 7.1.3)
  embedder:        embedder-node (GTX 1650 Super)
  database:        PostgreSQL 16 (local)

This is a system that accumulates experience. Every interaction either reinforces existing knowledge or introduces new chunks. Productions that lead to good outcomes strengthen. Those that don't, decay. The memory layer isn't a feature bolted onto an AI agent. It's the foundation that makes persistent, adaptive behavior possible.

ACT-R Cognitive Architecture for AI Agents — the theoretical foundation
Aegis Falls Architecture — the full system this memory layer powers
Building Self-Improving Agents — how memory enables agent evolution

The Aegis Falls Memory Layer

Persistent Memory via ACT-R

Activation Dynamics in Practice

The Hardware: Local Inference

The Hybrid Approach

Continuous Operation

Related