The Problem
Since I started it up a few weeks ago, my AI agent Pip loaded a single MEMORY.md file at the start of every conversation. It started small — a few notes about preferences, ongoing projects, key decisions. But knowledge accumulates. By last weekend, that file had ballooned to 5,000+ tokens. Every session, whether we were debugging home lab infrastructure or just checking the weather, Pip loaded the entire history: project details, architectural decisions for ideas that will not see the light of day, blog post ideas.
The cost wasn’t just monetary (though at $3 per million input tokens, it adds up). The bigger problem was cognitive: Pip was drowning in context that rarely mattered for the current task. This caused session compaction to happen more frequently, and when it did, Pip ‘forgot’ what it was working in.
It was like working with a goldfish.
The Solution: An Index + Detail Files
We redesigned memory as a hierarchical system:
MEMORY.mdbecomes a lightweight index (~1.5k tokens)- Detail lives in subdirectories:
memory/people/,memory/projects/,memory/decisions/ - Load what you need, when you need it
The Index Structure
👤 People
| Name | Triggers | File |
|---|---|---|
| Andy Bold | andy, human, user | memory/people/andy.md |
📦 Projects
| Project | Triggers | File |
|---|---|---|
| Backstage | backstage, catalog | memory/projects/backstage.md |
| Clawdbot | openclaw, gateway | memory/projects/clawdbot.md |
| PipDroid | android, node, mobile | memory/projects/pipdroid.md |
The index contains trigger words — keywords that signal when to drill down. If the conversation mentions “backstage” or “catalog,” Pip knows to load memory/projects/backstage.md. If not, that 2k token file stays on disk.
Active Context
The index also tracks 2-3 “active context” files — projects or topics that are hot right now. These get loaded automatically:
🔥 Active Context (Always Load)
| File | Why Active |
|---|---|
memory/projects/cirrus-north.md |
New business, active setup |
memory/projects/backstage.md |
Ongoing development |
The Rules
- Update the index with every detail file change (same commit, no exceptions)
- Keep the index under 3k tokens — archive inactive items
- Active Context: Max 2-3 files — rotate based on what’s hot
- Max 5 drill-downs at session start (beyond Active Context)
- Don’t skip drill-downs — loading a file is cheaper than a wrong assumption
The Implementation: QMD
After designing the hierarchical structure, we needed a backend that could efficiently search across these files. Enter QMD (Quick Markdown Documents) — a lightweight CLI tool for semantic search over markdown collections. This is already an experimental backend for OpenClaw, and it is very experimental. The idea is sound, though.
Why QMD?
QMD provides:
- Fast vector + BM25 hybrid search across markdown files
- Automatic collection management (watches for changes, re-indexes incrementally)
- Low overhead — runs as a sidecar process, indexes stored in SQLite
- Simple CLI interface — integrates cleanly with the agent toolchain
Configuration
memory: backend: "qmd" citations: "auto" qmd: includeDefaultMemory: true # Index MEMORY.md + memory/**/*.md paths: - path: ~/notes name: docs pattern: "**/*.md" update: interval: "5m" # Refresh indexes every 5 minutes debounceMs: 15000 # Debounce rapid file changes limits: maxResults: 6 maxSnippetChars: 700 timeoutMs: 4000
How It Works
- Indexing: QMD watches the workspace memory directory and configured paths, building vector embeddings + BM25 indexes for fast search
- Search: When the agent receives a message, it queries QMD with relevant context
- Results: QMD returns ranked snippets with file paths and line numbers
- Retrieval: The agent uses
memory_getto pull only the needed chunks from the full files
Performance
QMD’s hybrid search (vector + BM25) typically returns relevant results in under 50ms for collections of several hundred markdown files. The index refresh runs in the background, so there’s no startup delay.
The Results
Before:
- Session start: 5-10k tokens (entire memory loaded)
- Every conversation paid the full cost
After:
- Session start: ~1.5k (index) + ~2k (active context) = ~3.5k typical
- Full load if needed: ~5-6k (vs old 15-20k)
- 70% token savings on typical sessions
With QMD:
- Semantic recall — finds relevant context even when trigger words don’t match exactly
- Broader coverage — can index external notes, documentation, project wikis
- Session history — (experimental) indexes past conversation transcripts for better continuity
What is the benefit?
This isn’t just about cost. It’s about scaling knowledge without scaling overhead.
As an AI agent accumulates months of context, the naive approach (load everything) breaks down. You end up either:
- Truncating old memories (losing continuity)
- Loading irrelevant context (wasting tokens and diluting attention)
- Manual pruning (high maintenance, error-prone)
Hierarchical memory + QMD lets knowledge grow without cognitive or financial bloat. New projects, people, decisions get their own files. The index stays lean. QMD handles semantic search. The agent drills down as needed.
Implementation Notes
- Storage: Plain Markdown files in a git repo (version controlled, diffable)
- Trigger matching: Keyword detection for explicit drills; QMD for semantic search
- Tools:
memory_search(QMD-backed),memory_get(file snippet retrieval) - Migration: It took about 2 hours to split the monolithic file into structured pieces; QMD setup was around 30 minutes
- QMD installation: Available via npm (
npm install -g qmd) or Homebrew (brew install qmd)
What’s Next
We’re exploring:
- Time-based archiving: decisions from 2025 move to
memory/archive/2025/. The index keeps a one-line summary; full detail is available on-demand via QMD. - Project sunset detection: if a project hasn’t been mentioned in 90 days, automatically move it out of Active Context.
- Cross-collection search: QMD can index multiple collections (memory, notes, wikis). We’re testing unified search across all knowledge sources.
Lessons Learned:
- Indexes are cheap, details are expensive — optimize for the common case
- Trigger words + semantic search — combine explicit and fuzzy recall
- Active Context is key — explicitly tracking “what’s hot” prevents thrashing
- Update discipline matters — index drift is the failure mode
- QMD as a sidecar — lightweight, decoupled, easy to debug
If you’re building long-running AI agents, start thinking about memory architecture early. Flat files scale poorly. Hierarchical memory + semantic search scales indefinitely.
Tools Used:
- QMD: https://github.com/ryanatkn/qmd (or
npm install -g qmd) - OpenClaw: https://docs.openclaw.ai (agent framework with built-in QMD integration)
- Storage: Private Git repo with markdown files
