The Problem

Since I started it up a few weeks ago, my AI agent Pip loaded a single MEMORY.md file at the start of every conversation. It started small — a few notes about preferences, ongoing projects, key decisions. But knowledge accumulates. By last weekend, that file had ballooned to 5,000+ tokens. Every session, whether we were debugging home lab infrastructure or just checking the weather, Pip loaded the entire history: project details, architectural decisions for ideas that will not see the light of day, blog post ideas.

The cost wasn’t just monetary (though at $3 per million input tokens, it adds up). The bigger problem was cognitive: Pip was drowning in context that rarely mattered for the current task. This caused session compaction to happen more frequently, and when it did, Pip ‘forgot’ what it was working in.

It was like working with a goldfish.

The Solution: An Index + Detail Files

We redesigned memory as a hierarchical system:

  • MEMORY.md becomes a lightweight index (~1.5k tokens)
  • Detail lives in subdirectories: memory/people/, memory/projects/, memory/decisions/
  • Load what you need, when you need it

The Index Structure

👤 People

Name Triggers File
Andy Bold andy, human, user memory/people/andy.md

📦 Projects

Project Triggers File
Backstage backstage, catalog memory/projects/backstage.md
Clawdbot openclaw, gateway memory/projects/clawdbot.md
PipDroid android, node, mobile memory/projects/pipdroid.md

The index contains trigger words — keywords that signal when to drill down. If the conversation mentions “backstage” or “catalog,” Pip knows to load memory/projects/backstage.md. If not, that 2k token file stays on disk.

Active Context

The index also tracks 2-3 “active context” files — projects or topics that are hot right now. These get loaded automatically:

🔥 Active Context (Always Load)

File Why Active
memory/projects/cirrus-north.md New business, active setup
memory/projects/backstage.md Ongoing development

The Rules

  1. Update the index with every detail file change (same commit, no exceptions)
  2. Keep the index under 3k tokens — archive inactive items
  3. Active Context: Max 2-3 files — rotate based on what’s hot
  4. Max 5 drill-downs at session start (beyond Active Context)
  5. Don’t skip drill-downs — loading a file is cheaper than a wrong assumption

The Implementation: QMD

After designing the hierarchical structure, we needed a backend that could efficiently search across these files. Enter QMD (Quick Markdown Documents) — a lightweight CLI tool for semantic search over markdown collections. This is already an experimental backend for OpenClaw, and it is very experimental. The idea is sound, though.

Why QMD?

QMD provides:

  • Fast vector + BM25 hybrid search across markdown files
  • Automatic collection management (watches for changes, re-indexes incrementally)
  • Low overhead — runs as a sidecar process, indexes stored in SQLite
  • Simple CLI interface — integrates cleanly with the agent toolchain

Configuration

memory:
backend: "qmd"
citations: "auto"
qmd:
includeDefaultMemory: true # Index MEMORY.md + memory/**/*.md
paths:
- path: ~/notes
name: docs
pattern: "**/*.md"
update:
interval: "5m" # Refresh indexes every 5 minutes
debounceMs: 15000 # Debounce rapid file changes
limits:
maxResults: 6
maxSnippetChars: 700
timeoutMs: 4000

How It Works

  1. Indexing: QMD watches the workspace memory directory and configured paths, building vector embeddings + BM25 indexes for fast search
  2. Search: When the agent receives a message, it queries QMD with relevant context
  3. Results: QMD returns ranked snippets with file paths and line numbers
  4. Retrieval: The agent uses memory_get to pull only the needed chunks from the full files

Performance

QMD’s hybrid search (vector + BM25) typically returns relevant results in under 50ms for collections of several hundred markdown files. The index refresh runs in the background, so there’s no startup delay.

The Results

Before:

  • Session start: 5-10k tokens (entire memory loaded)
  • Every conversation paid the full cost

After:

  • Session start: ~1.5k (index) + ~2k (active context) = ~3.5k typical
  • Full load if needed: ~5-6k (vs old 15-20k)
  • 70% token savings on typical sessions

With QMD:

  • Semantic recall — finds relevant context even when trigger words don’t match exactly
  • Broader coverage — can index external notes, documentation, project wikis
  • Session history — (experimental) indexes past conversation transcripts for better continuity

What is the benefit?

This isn’t just about cost. It’s about scaling knowledge without scaling overhead.

As an AI agent accumulates months of context, the naive approach (load everything) breaks down. You end up either:

  • Truncating old memories (losing continuity)
  • Loading irrelevant context (wasting tokens and diluting attention)
  • Manual pruning (high maintenance, error-prone)

Hierarchical memory + QMD lets knowledge grow without cognitive or financial bloat. New projects, people, decisions get their own files. The index stays lean. QMD handles semantic search. The agent drills down as needed.

Implementation Notes

  • Storage: Plain Markdown files in a git repo (version controlled, diffable)
  • Trigger matching: Keyword detection for explicit drills; QMD for semantic search
  • Tools: memory_search (QMD-backed), memory_get (file snippet retrieval)
  • Migration: It took about 2 hours to split the monolithic file into structured pieces; QMD setup was around 30 minutes
  • QMD installation: Available via npm (npm install -g qmd) or Homebrew (brew install qmd)

What’s Next

We’re exploring:

  • Time-based archiving: decisions from 2025 move to memory/archive/2025/. The index keeps a one-line summary; full detail is available on-demand via QMD.
  • Project sunset detection: if a project hasn’t been mentioned in 90 days, automatically move it out of Active Context.
  • Cross-collection search: QMD can index multiple collections (memory, notes, wikis). We’re testing unified search across all knowledge sources.

Lessons Learned:

  1. Indexes are cheap, details are expensive — optimize for the common case
  2. Trigger words + semantic search — combine explicit and fuzzy recall
  3. Active Context is key — explicitly tracking “what’s hot” prevents thrashing
  4. Update discipline matters — index drift is the failure mode
  5. QMD as a sidecar — lightweight, decoupled, easy to debug

If you’re building long-running AI agents, start thinking about memory architecture early. Flat files scale poorly. Hierarchical memory + semantic search scales indefinitely.


Tools Used:

Andy Bold Avatar

Published by

Categories: ,

Discover more from Old Man Shouts at Cloud

Subscribe now to keep reading and get access to the full archive.

Continue reading