Hierarchical Memory: How I Saved 70% on AI Agent Context

The Problem

Since I started it up a few weeks ago, my AI agent Pip loaded a single MEMORY.md file at the start of every conversation. It started small — a few notes about preferences, ongoing projects, key decisions. But knowledge accumulates. By last weekend, that file had ballooned to 5,000+ tokens. Every session, whether we were debugging home lab infrastructure or just checking the weather, Pip loaded the entire history: project details, architectural decisions for ideas that will not see the light of day, blog post ideas.

The cost wasn’t just monetary (though at $3 per million input tokens, it adds up). The bigger problem was cognitive: Pip was drowning in context that rarely mattered for the current task. This caused session compaction to happen more frequently, and when it did, Pip ‘forgot’ what it was working in.

It was like working with a goldfish.

The Solution: An Index + Detail Files

We redesigned memory as a hierarchical system:

MEMORY.md becomes a lightweight index (~1.5k tokens)
Detail lives in subdirectories: memory/people/, memory/projects/, memory/decisions/
Load what you need, when you need it

The Index Structure

👤 People

Name	Triggers	File
Andy Bold	andy, human, user	`memory/people/andy.md`

📦 Projects

Project	Triggers	File
Backstage	backstage, catalog	`memory/projects/backstage.md`
Clawdbot	openclaw, gateway	`memory/projects/clawdbot.md`
PipDroid	android, node, mobile	`memory/projects/pipdroid.md`

The index contains trigger words — keywords that signal when to drill down. If the conversation mentions “backstage” or “catalog,” Pip knows to load memory/projects/backstage.md. If not, that 2k token file stays on disk.

Active Context

The index also tracks 2-3 “active context” files — projects or topics that are hot right now. These get loaded automatically:

🔥 Active Context (Always Load)

File	Why Active
`memory/projects/cirrus-north.md`	New business, active setup
`memory/projects/backstage.md`	Ongoing development

The Rules

Update the index with every detail file change (same commit, no exceptions)
Keep the index under 3k tokens — archive inactive items
Active Context: Max 2-3 files — rotate based on what’s hot
Max 5 drill-downs at session start (beyond Active Context)
Don’t skip drill-downs — loading a file is cheaper than a wrong assumption

The Implementation: QMD

After designing the hierarchical structure, we needed a backend that could efficiently search across these files. Enter QMD (Quick Markdown Documents) — a lightweight CLI tool for semantic search over markdown collections. This is already an experimental backend for OpenClaw, and it is very experimental. The idea is sound, though.

Why QMD?

QMD provides:

Fast vector + BM25 hybrid search across markdown files
Automatic collection management (watches for changes, re-indexes incrementally)
Low overhead — runs as a sidecar process, indexes stored in SQLite
Simple CLI interface — integrates cleanly with the agent toolchain

Configuration

			
memory:
  backend: "qmd"
  citations: "auto"
  qmd:
    includeDefaultMemory: true  # Index MEMORY.md + memory/**/*.md
    paths:
      - path: ~/notes
        name: docs
        pattern: "**/*.md"
    update:
      interval: "5m"           # Refresh indexes every 5 minutes
      debounceMs: 15000        # Debounce rapid file changes
    limits:
      maxResults: 6
      maxSnippetChars: 700
      timeoutMs: 4000

		

How It Works

Indexing: QMD watches the workspace memory directory and configured paths, building vector embeddings + BM25 indexes for fast search
Search: When the agent receives a message, it queries QMD with relevant context
Results: QMD returns ranked snippets with file paths and line numbers
Retrieval: The agent uses memory_get to pull only the needed chunks from the full files

Performance

QMD’s hybrid search (vector + BM25) typically returns relevant results in under 50ms for collections of several hundred markdown files. The index refresh runs in the background, so there’s no startup delay.

The Results

Before:

Session start: 5-10k tokens (entire memory loaded)
Every conversation paid the full cost

After:

Session start: ~1.5k (index) + ~2k (active context) = ~3.5k typical
Full load if needed: ~5-6k (vs old 15-20k)
70% token savings on typical sessions

With QMD:

Semantic recall — finds relevant context even when trigger words don’t match exactly
Broader coverage — can index external notes, documentation, project wikis
Session history — (experimental) indexes past conversation transcripts for better continuity

What is the benefit?

This isn’t just about cost. It’s about scaling knowledge without scaling overhead.

As an AI agent accumulates months of context, the naive approach (load everything) breaks down. You end up either:

Truncating old memories (losing continuity)
Loading irrelevant context (wasting tokens and diluting attention)
Manual pruning (high maintenance, error-prone)

Hierarchical memory + QMD lets knowledge grow without cognitive or financial bloat. New projects, people, decisions get their own files. The index stays lean. QMD handles semantic search. The agent drills down as needed.

Implementation Notes

Storage: Plain Markdown files in a git repo (version controlled, diffable)
Trigger matching: Keyword detection for explicit drills; QMD for semantic search
Tools: memory_search (QMD-backed), memory_get (file snippet retrieval)
Migration: It took about 2 hours to split the monolithic file into structured pieces; QMD setup was around 30 minutes
QMD installation: Available via npm (npm install -g qmd) or Homebrew (brew install qmd)

What’s Next

We’re exploring:

Time-based archiving: decisions from 2025 move to memory/archive/2025/. The index keeps a one-line summary; full detail is available on-demand via QMD.
Project sunset detection: if a project hasn’t been mentioned in 90 days, automatically move it out of Active Context.
Cross-collection search: QMD can index multiple collections (memory, notes, wikis). We’re testing unified search across all knowledge sources.

Lessons Learned:

Indexes are cheap, details are expensive — optimize for the common case
Trigger words + semantic search — combine explicit and fuzzy recall
Active Context is key — explicitly tracking “what’s hot” prevents thrashing
Update discipline matters — index drift is the failure mode
QMD as a sidecar — lightweight, decoupled, easy to debug

If you’re building long-running AI agents, start thinking about memory architecture early. Flat files scale poorly. Hierarchical memory + semantic search scales indefinitely.

Tools Used:

QMD: https://github.com/ryanatkn/qmd (or npm install -g qmd)
OpenClaw: https://docs.openclaw.ai (agent framework with built-in QMD integration)
Storage: Private Git repo with markdown files

Old Man Shouts at Cloud

Hierarchical Memory: How I Saved 70% on AI Agent Context

The Problem

The Solution: An Index + Detail Files

The Index Structure

👤 People

📦 Projects

Active Context

🔥 Active Context (Always Load)

The Rules

The Implementation: QMD

Why QMD?

Configuration

How It Works

Performance

The Results

What is the benefit?

Implementation Notes

What’s Next

Like this:

The Problem

The Solution: An Index + Detail Files

The Index Structure

👤 People

📦 Projects

Active Context

🔥 Active Context (Always Load)

The Rules

The Implementation: QMD

Why QMD?

Configuration

How It Works

Performance

The Results

What is the benefit?

Implementation Notes

What’s Next

Share this:

Like this:

Discover more from Old Man Shouts at Cloud