How to Build AI Agent Memory with Claude Code Plugins
You build AI agent memory by capturing session artifacts (diffs, decisions, errors) into a structured store, compressing them into context-window-friendly summaries, and reinjecting the relevant slice at the start of each new session. The core loop is: observe the session, extract key facts, compress into a retrieval-friendly format, and inject via system prompt or tool output when the agent starts fresh. This works with Claude Code's CLAUDE.md convention, MCP tool servers, or a standalone plugin architecture that you wire into any agent SDK.
Architecture Overview
The system has three components: a MemoryStore that persists structured observations, a Compressor that distills raw memories into token-efficient summaries, and an Injector that selects and formats relevant context for the next session. You use SQLite for local persistence and a straightforward embedding-based retrieval layer. This approach beats flat-file memory (like appending to CLAUDE.md) because it supports semantic search and time-decay ranking rather than forcing the agent to read everything linearly.
import sqlite3
import json
import hashlib
from datetime import datetime
class MemoryStore:
def __init__(self, db_path="agent_memory.db"):
self.conn = sqlite3.connect(db_path)
self._init_schema()
def _init_schema(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS memories (
id TEXT PRIMARY KEY,
session_id TEXT,
category TEXT,
content TEXT,
embedding BLOB,
created_at TEXT,
access_count INTEGER DEFAULT 0
)
""")
self.conn.commit()
Capturing Session Observations
During a coding session, you intercept key events: file changes, error resolutions, architectural decisions, and user corrections. Each observation gets a category tag so the compressor and injector can filter effectively. The content_hash deduplicates repeated observations across sessions, which happens constantly when the agent rediscovers the same project conventions.
def record_memory(self, session_id, category, content, embedding=None):
# Deduplicate by content hash.
content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]
memory_id = f"{category}_{content_hash}"
self.conn.execute("""
INSERT INTO memories (id, session_id, category, content, embedding, created_at)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
access_count = access_count + 1,
session_id = excluded.session_id
""", (
memory_id, session_id, category,
content, embedding, datetime.utcnow().isoformat()
))
self.conn.commit()
return memory_id
Compressing Memories for Context Windows
Raw observations are verbose. A single debugging session might produce 50 memories totaling 15,000 tokens. The compressor groups memories by category, merges duplicates, and uses a two-pass strategy: first, it clusters related memories; second, it summarizes each cluster into a single directive. The output targets a strict token budget so you never blow your context window. The max_tokens_per_category parameter controls the tradeoff between memory richness and available context for the actual task.
from collections import defaultdict
class MemoryCompressor:
def __init__(self, max_tokens_per_category=500):
self.budget = max_tokens_per_category
def compress(self, memories):
# Group memories by category.
groups = defaultdict(list)
for mem in memories:
groups[mem["category"]].append(mem["content"])
compressed = {}
for category, items in groups.items():
# Deduplicate and sort by frequency.
unique_items = list(dict.fromkeys(items))
# Truncate to budget using rough 4-chars-per-token estimate.
joined = "\n- ".join(unique_items)
char_limit = self.budget * 4
compressed[category] = joined[:char_limit]
return compressed
Retrieval With Semantic Search and Time Decay
Not all memories are equally relevant. A decision about database schema matters more than a typo fix from three weeks ago. The retriever scores memories using cosine similarity against the current task description, then applies an exponential time-decay factor. Memories that get accessed frequently (high access_count) receive a boost, which naturally surfaces project conventions that the agent keeps needing. If you don't have an embedding model available locally, fall back to keyword overlap using TF-IDF. That approach works surprisingly well for code-related queries.
import math
from datetime import datetime
def score_memory(memory_row, query_embedding=None, now=None):
now = now or datetime.utcnow()
created = datetime.fromisoformat(memory_row["created_at"])
# Calculate age in days.
age_days = (now - created).total_seconds() / 86400
# Exponential decay with 30-day half-life.
recency_score = math.exp(-0.693 * age_days / 30)
# Frequency boost capped at 2x.
freq_boost = min(1 + memory_row["access_count"] * 0.1, 2.0)
# Combine scores. Similarity would replace 1.0 if embeddings are available.
similarity = 1.0
return similarity * recency_score * freq_boost
Injecting Context Into Claude Code Sessions
Claude Code reads project-level context from CLAUDE.md at the repository root. The most straightforward injection strategy writes compressed memories into this file before each session. For more sophisticated setups, you can run an MCP (Model Context Protocol) tool server that Claude Code calls on demand, exposing recall_memory and store_memory as tools. The MCP approach works better for large projects because the agent pulls only what it needs rather than front-loading everything.
import os
class ContextInjector:
def __init__(self, store, compressor, project_root="."):
self.store = store
self.compressor = compressor
self.claude_md_path = os.path.join(project_root, "CLAUDE.md")
def inject_to_claude_md(self, task_description=""):
# Fetch top memories ranked by relevance.
rows = self.store.conn.execute(
"SELECT * FROM memories ORDER BY access_count DESC LIMIT 50"
).fetchall()
columns = ["id", "session_id", "category", "content", "embedding", "created_at", "access_count"]
memories = [dict(zip(columns, r)) for r in rows]
compressed = self.compressor.compress(memories)
# Write the memory section into CLAUDE.md.
section = "\n## Agent Memory (auto-generated)\n"
for cat, content in compressed.items():
section += f"\n### {cat}\n- {content}\n"
self._append_or_replace(section)
def _append_or_replace(self, section):
marker_start = "<!-- AGENT_MEMORY_START -->"
marker_end = "<!-- AGENT_MEMORY_END -->"
existing = ""
if os.path.exists(self.claude_md_path):
with open(self.claude_md_path, "r") as f:
existing = f.read()
# Replace existing memory block or append.
if marker_start in existing:
before = existing.split(marker_start)[0]
after = existing.split(marker_end)[1] if marker_end in existing else ""
updated = f"{before}{marker_start}\n{section}\n{marker_end}{after}"
else:
updated = f"{existing}\n{marker_start}\n{section}\n{marker_end}\n"
with open(self.claude_md_path, "w") as f:
f.write(updated)
MCP Tool Server Alternative
For dynamic retrieval during a session, expose the memory system as an MCP tool server. You can configure Claude Code to connect to local MCP servers through its settings. This approach lets the agent decide when to recall or store memories rather than front-loading everything. The tradeoff: it costs extra tool-call round trips, but keeps the system prompt lean. You configure it in ~/.claude/claude_code_config.json under the mcpServers key.
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("agent-memory")
store = MemoryStore()
@mcp.tool()
def recall_memory(query: str, category: str = "", limit: int = 10) -> str:
"""Retrieve relevant memories from past sessions."""
sql = "SELECT category, content FROM memories"
params = []
if category:
sql += " WHERE category = ?"
params.append(category)
sql += " ORDER BY access_count DESC LIMIT ?"
params.append(limit)
rows = store.conn.execute(sql, params).fetchall()
return "\n".join(f"[{r[0]}] {r[1]}" for r in rows)
@mcp.tool()
def store_memory(category: str, content: str) -> str:
"""Store a memory for future sessions."""
mid = store.record_memory("current", category, content)
return f"Stored memory {mid}"
Gotchas and Production Considerations
Context window budgeting is the number one failure mode. If your compressed memories consume 8,000 tokens of a 128k window, that seems fine, but those 8,000 tokens compound with file contents, tool outputs, and conversation history. Always set a hard ceiling and measure it. A safe default is 5% of the model's context window for injected memory.
Memory poisoning is real. If the agent stores an incorrect assumption (like "this project uses PostgreSQL" when it actually uses MySQL), that error propagates into every future session. Add a confidence field and decay low-confidence memories aggressively. Better yet, let the user flag incorrect memories through a forget_memory tool.
The CLAUDE.md approach vs. the MCP approach. Use CLAUDE.md injection for small-to-medium projects where the total memory fits in a few hundred tokens. Switch to MCP when your memory store grows past 100 entries or when different tasks in the same project need different memory slices. The two approaches aren't mutually exclusive. Put stable project conventions in CLAUDE.md and dynamic session context in MCP.
SQLite concurrency. If multiple agent processes share the same memory database (common in CI pipelines or parallel agent runs), enable WAL mode with PRAGMA journal_mode=WAL to avoid write locks. For team-wide shared memory, swap SQLite for PostgreSQL with pgvector for native embedding search.