Homelab Architecture for AI

This isn't a shopping list. It's a document about why decisions were made, not what to buy. Every component and configuration choice exists because a specific problem demanded it. The homelab evolved through years of trial, failure, and iteration — and the current architecture reflects lessons that only come from running systems 24/7 and fixing them when they break at inconvenient times.

Your development environment shapes what you can build. If you want to run AI agent systems locally, your infrastructure needs to support GPU isolation, persistent services, fast storage, and the kind of reliability that lets you trust a system to run unattended for weeks. Generic homelab guides don't address these requirements. This one does.

Why Proxmox over Bare Metal

The first decision was virtualization strategy. The choice shaped everything that followed.

Bare metal gives you maximum performance and zero isolation. When your AI agent pipeline crashes — and it will crash during development — it takes down your development environment, your databases, your monitoring, everything. One bad experiment with a model loader and you're rebuilding from scratch. For a system that runs experimental code continuously, that's not acceptable. I learned this the hard way before Proxmox, when a runaway process filled the root partition and took down every service on the machine.

Docker-only gives good application isolation but poor hardware isolation. GPU passthrough in Docker is possible but fragile — the NVIDIA container runtime works adequately, but AMD/ROCm in containers is a different story. Container networking gets complicated fast when you need inter-service communication with low latency. And when you need a full VM for something (like testing a different kernel or running a network appliance), containers can't help.

Proxmox VE gives KVM-based virtualization with LXC containers. Full hardware isolation where it matters (GPU workloads, untrusted code), lightweight containers where it doesn't (web services, monitoring, DNS). A web UI for management. ZFS integration for storage. Snapshot and backup built in.

Proxmox won because AI development needs both: full VM isolation for GPU workloads and lightweight containers for services. The overhead is minimal — KVM on modern hardware adds negligible latency to compute workloads. And the flexibility is worth it: I can snapshot a working state before running something experimental, and roll back in seconds if it goes wrong. That capability alone has saved hours of rebuilding.

The killer feature for AI work is snapshots. Before updating ROCm, snapshot. Before changing model loading parameters, snapshot. Before trying a new kernel that might break IOMMU groups, snapshot. If it works, delete the snapshot. If it doesn't, roll back. The ability to experiment without risk changes how aggressively you can iterate.

The GPU Passthrough Journey

GPU passthrough is the most technically demanding part of this setup, and the part where most homelab AI builds fail. Getting it wrong means either no GPU acceleration or shared, degraded GPU access. Getting it right means the VM has bare-metal GPU performance.

IOMMU Groups

IOMMU (Input-Output Memory Management Unit) groups define which PCI devices can be isolated from the host. The GPU needs to be in its own IOMMU group, or at least in a group where every device can be passed through together. This is a hardware and BIOS constraint, not a software one. If your motherboard puts the GPU in an IOMMU group with other devices you can't pass through, you're stuck.

The Xeon E5-2690v4 platform has clean IOMMU grouping. The R9700 lands in its own group. Both the video and audio functions of the GPU pass through together. This wasn't luck — I checked IOMMU group layouts before buying the hardware. If you're building for passthrough, check this first. Everything else depends on it.

VFIO-PCI Binding

The host kernel must not claim the GPU. If the host loads a GPU driver (amdgpu, radeon), the device is bound and can't be passed through. The solution is vfio-pci: a stub driver that claims the device at boot and holds it for passthrough.

Configuration is in the kernel command line and modprobe configs. The GPU's PCI IDs get added to vfio-pci's device list. At boot, vfio-pci claims the GPU before any display driver can. The host never sees it as a display device. The VM gets exclusive access.

This means the host has no display output from that GPU. For a headless server, that's fine. For a workstation that also serves as a desktop, it's a tradeoff. In this setup, the host is headless — managed entirely through the Proxmox web UI and SSH. The GPU exists solely for compute.

ROCm Under KVM

ROCm 7.1.3 runs inside the VM with full device access. The VM sees the R9700 as if it were a bare-metal install. Ollama loads models, PyTorch runs inference, and the ACT-R memory system does activation computation — all at native performance.

The ROCm setup inside the VM is standard: install the kernel module, the runtime libraries, and the user-space tools. Because the passthrough gives true hardware access, there's no virtualization-specific configuration needed on the ROCm side. It just works as if the GPU were physically installed in the VM.

Why Two Nodes

The system has two compute nodes, and the split is deliberate.

Hypervisor — Physical Host

CPU: Intel Xeon E5-2690v4 — 14 cores, 28 threads at 2.60GHz. The physical machine hosting all VMs and containers.
RAM: 128GB DDR4 ECC — shared across Proxmox and all virtual machines.
GPU: AMD Radeon AI PRO R9700 32GB — passed through via VFIO to the ageis-node VM.
Storage: SSD pool (2x 1TB TeamGroup + Samsung 870 EVO + Samsung 960 NVMe) + 22TB WD White HDD.
Role: Runs Proxmox with KVM VMs and LXC containers. This is the physical foundation.

ageis-node — AI VM

vCPUs: 4 allocated from the Xeon pool.
RAM: 16GB allocated (expandable).
GPU: AMD R9700 32GB via PCIe passthrough — the inference engine. Handles local model inference, activation computation, and heavy processing.
Role: Runs the OpenClaw pipeline, ACT-R memory system, Ollama, and PostgreSQL. This is where the AI work happens.

Embedder Node

GPU: GTX 1650 Super 4GB — dedicated to embedding generation and TTS.
OS: Debian 13, purpose-built for the embedding service.
Role: Vectorizes new memory chunks, processes documents for ingestion, maintains similarity search indices, handles text-to-speech generation.

The split exists to keep VRAM free on ageis-node. The R9700's 32GB is loaded with inference models — a 70B quantized model occupies most of that capacity. If embeddings ran on the same GPU, either the inference model gets smaller (reducing quality) or embedding generation competes for VRAM (reducing reliability). A separate node with a dedicated smaller GPU handles embeddings without touching the inference budget.

The GTX 1650 Super is plenty for embedding workloads. Embedding models are small. TTS models are small. The 4GB of VRAM handles both comfortably. The embedder node communicates with ageis-node over the LAN with sub-millisecond latency. From the pipeline's perspective, embeddings just appear — the network hop is invisible.

Storage Architecture

Storage is split by access pattern, not by data type:

SSD pool — Active workloads. Model weights currently in use. PostgreSQL databases including the ACT-R memory store. Running project files. Anything where I/O latency matters. The memory system lives here because retrieval speed directly affects agent response time — a slow disk means slow memory retrieval means slow agent responses. At 342 retrievals per hour, every millisecond of I/O latency compounds.
HDD pool (22TB) — Archives, datasets, model checkpoints, backups, media. High capacity, acceptable latency. Training data that gets loaded once per run. Old project snapshots. Proxmox backup targets. Model versions you might want to roll back to.

The separation of concerns is strict. The SSD pool never fills up with old data because archives tier down to HDD. The HDD pool provides cheap capacity for everything that doesn't need speed. Active models move to SSD when loaded, back to HDD when archived. The pipeline doesn't care where storage is — mount points abstract the underlying hardware.

Network Design

The network is deliberately simple. Complexity in networking creates failure modes that are hard to debug at 2am when your agent pipeline stops responding.

LAN

Standard gigabit between nodes. The main compute node and the embedder communicate over the local network. Low latency, high bandwidth, no internet dependency. The embedding service is just an HTTP endpoint on the LAN. The pipeline calls it like any other local service.

Tailscale

Mesh VPN for remote access. No port forwarding, no exposed services, no dynamic DNS. Tailscale creates a WireGuard tunnel that just works. I can SSH into any node from anywhere — phone, laptop, another network — without punching holes in the firewall. The security model is simple: nothing is exposed to the internet. All remote access goes through Tailscale's authenticated tunnel.

What's Not Here

No Kubernetes. No service mesh. No container orchestration platform. The system has two nodes and a handful of services. The right tool for that scale is SSH and systemd, not a cluster management framework designed for hundreds of nodes. Adding Kubernetes to a two-node homelab is resume-driven development. Systemd unit files and a few shell scripts manage everything reliably.

Lessons from 24/7 Operation

Running systems continuously teaches things that weekend projects don't. Here's what actually matters when hardware runs unattended for weeks.

Thermal Management

The Xeon under sustained load generates serious heat. The R9700 under inference load adds more. In a server closet or a warm room, thermals become the primary failure mode. Fans wear out. Thermal paste dries. Ambient temperature creep during summer pushes components closer to throttling thresholds. Monitoring thermal trends over weeks matters more than peak temperatures during benchmarks. I've added temperature alerting to the monitoring stack because a gradual thermal increase is the earliest warning sign that something needs attention.

What Actually Breaks

In order of frequency: drives, fans, and network connections. Drives give SMART warnings before they die — if you're monitoring. Fans get louder before they stop — if you're listening. Network connections drop when a cable gets bumped or a switch reboots — and the pipeline fails in confusing ways because "network timeout" looks different from every service's perspective. The most important investment isn't better hardware. It's monitoring and alerting that tells you something is degrading before it fails.

Power and Restarts

Power outages happen. UPS buys time but not immunity. Every service needs to start cleanly after an unexpected power loss. The inference VM needs to reload models. The memory system needs to verify store integrity. The pipeline needs to resume or discard in-progress tasks. Getting this right means every service has startup checks, and the boot order matters: storage first, then databases, then inference, then the pipeline. Systemd dependencies handle the ordering. Testing it means pulling the plug intentionally, which is an uncomfortable but necessary test.

Backup Strategy

The memory store is the most valuable data in the system. It represents months of accumulated experience. Losing it means starting from zero. Proxmox's built-in backup runs nightly to the HDD pool. The PostgreSQL database gets WAL-archived continuously. The backup is tested by restoring periodically — a backup you haven't tested is just a file that might contain your data.

$ ./infrastructure-status.sh

[hypervisor]
  host:     Proxmox VE 8.3
  cpu:      Intel Xeon E5-2690v4 (14C/28T) @ 2.60GHz
  ram:      128GB DDR4 ECC
  gpu:      AMD R9700 32GB (passed through to ageis-node)
  storage:  SSD pool + 22TB HDD
  thermals: cpu 62C / gpu 71C (sustained load)
  network:  1Gbps LAN + Tailscale mesh

[ageis-node - AI VM]
  vcpus:    4 (from Xeon pool)
  ram:      16GB allocated
  rocm:     7.1.3
  vram:     26.4 / 32.0 GB allocated
  model:    Llama-3.1-70B-Q4_K_M (loaded)
  services: ollama, actr-engine, openclaw-pipeline

[embedder-node]
  os:       Debian 13
  gpu:      GTX 1650 Super 4GB
  services: embedding-server, vector-indexer, tts-engine
  status:   ONLINE (latency: 0.3ms)

[storage health]
  ssd pool: healthy (SMART ok, 14% wear)
  hdd pool: healthy (SMART ok)
  backups:  nightly to hdd (last: 6h ago, 247GB)

[network]
  tailscale: 3 nodes connected
  dns:       local resolver active
  firewall:  no exposed ports

The homelab isn't a showpiece. It's a workshop. Every component is here because it solves a specific problem in the AI development workflow. Nothing more, nothing less. The architecture will keep evolving as requirements change — but the principles stay the same: own the hardware, understand the stack, build for reliability, and treat every failure as a lesson about what to monitor next.

Why Local AI Matters — the philosophy behind local-first infrastructure
Aegis Falls Architecture — the AI system running on this infrastructure
Homelab Infrastructure — current hardware specs and status

Homelab Architecture for AI Development