brianletort.ai
All Posts

Building SemanticStudio Series

An 8-part deep dive into building a production-ready multi-agent chat platform

SemanticStudioMemory SystemsContext EngineeringRAG

Memory as Infrastructure: The Complete 4-Tier System

A deep dive into SemanticStudio's 4-tier memory architecture—working context, session memory, long-term memory, and the Context Graph. Progressive compression meets knowledge bridging.

January 26, 20268 min read

TL;DR

  • 4-tier memory: working context, session memory, long-term memory, Context Graph
  • Rolling window compression for unlimited conversation length
  • Context Graph bridges conversation history to domain knowledge entities

Context windows aren't infinite. Tokens aren't free. Memory management matters.

I've written about context engineering as a discipline—the art of managing what goes into the context window. SemanticStudio implements those principles in a production 4-tier memory system with rolling window compression and a knowledge graph bridge.

The Memory Problem

Every AI conversation faces the same challenge:

  1. Context windows have limits: Even 128K tokens fills up fast
  2. Tokens cost money: More context = higher costs
  3. Attention dilutes: More content = less focus on what matters
  4. History accumulates: Long conversations overflow capacity

The naive solution is "stuff everything in." It doesn't work at scale.

The sophisticated solution is tiered memory—different storage layers with different retrieval strategies, plus a Context Graph that bridges your conversation history to domain knowledge.

The 4-Tier Architecture

SemanticStudio implements four memory tiers, inspired by MemGPT and traditional computer architecture:

4-Tier Memory System

Click each tier to explore, or animate fact flow

What's stored

  • Last 3 conversation exchanges
  • Current session summary
  • Active system instructions
  • Current query and intent

Retrieval Method

Always included

TTL

Current request only

promotes to
promotes to
promotes to

Example Facts by Tier

Last question was about revenueCurrently analyzing Q4 dataUser's role is Finance ManagerDiscussed Acme Corp (customer)Queried Harris County (region)

Key insight: The memory extractor identifies facts worth keeping and promotes them between tiers automatically. You never lose important context.

Tier 1: Working Context

What it is: The active, in-context memory for every request.

What's included:

  • Last 3 conversation exchanges
  • Current session summary
  • Active system instructions
  • Current query and intent

Retrieval: Always included—no retrieval needed.

Cost: Low but always paid. Every request includes Tier 1.

TTL: Current request only.

SemanticStudio mode configuration showing memory tier toggles

Tier 2: Session Memory

What it is: Facts and context from the current session, retrieved on demand.

What's included:

  • Relevant past turns from current session
  • Session-extracted facts
  • Recent file references
  • Conversation patterns

Retrieval: Vector similarity search against session content.

Cost: Medium—only retrieved when relevant.

TTL: Session duration (cleared when session ends).

Tier 3: Long-term Memory

What it is: User profile facts that persist across sessions.

What's included:

  • User profile facts
  • Cross-session knowledge
  • Explicitly saved memories
  • Learned preferences

Retrieval: Selective retrieval based on query relevance.

Cost: Variable—only high-value facts retrieved.

TTL: Indefinite, controlled by user.

Tier 4: Context Graph

What it is: A knowledge bridge that links your conversation context to domain entities in the knowledge graph.

What's tracked:

  • Entities you've discussed, queried, or mentioned
  • Links between your memories and business data
  • Reference context (the snippet where the entity appeared)

Reference Types:

TypeWhen Created
discussedEntity appears in assistant response
queriedEntity appears in user question
mentionedEntity linked from saved facts
interested_inRepeated queries about same entity

Query Capabilities:

  • "What did I discuss about Customer X?"
  • "Show me entities I've queried recently"
  • "Which topics have I focused on?"

Privacy: Each user's context references are isolated—your links to entities are private to you.

Context Graph Bridge

Your memory ↔ Domain knowledge connections

Your Context
Focus: Q4 analysis
session fact
Region: Texas
session fact
Finance Manager role
user memory
Prefers tables over charts
saved memory
Knowledge Graph
Acme Corp
customer
Harris County
region
Widget Pro
product
Sarah Chen
employee
Connections:
DiscussedQueriedMentionedInterested In
Reference Types:
Discussed
Queried
Mentioned
Interested In

Key insight: The Context Graph links your conversation history to domain entities. Ask "What did I discuss about X?" and get answers grounded in your actual conversations—not just what's in the knowledge base.

The Memory Extractor

The magic happens in the memory extractor—a specialized model that identifies facts worth keeping.

How It Works

After each exchange:

  1. Fact Identification: Scan the conversation for extractable facts
  2. Confidence Scoring: Rate each fact's importance
  3. Tier Assignment: Determine which tier should store it
  4. Deduplication: Avoid storing redundant facts

What Gets Extracted

The extractor looks for:

  • User preferences: "I prefer concise answers"
  • Domain context: "I work in the finance department"
  • Session-specific: "We're analyzing Q4 data"
  • Explicit saves: "Remember that the deadline is March 15"

Promotion Between Tiers

Facts can be promoted:

Working Context → Session Memory (within session)
Session Memory → Long-term Memory (cross-session value)

High-confidence facts that appear repeatedly get promoted automatically.

Memory Configuration

Users control their memory through settings:

SemanticStudio settings showing memory preferences

Per-Tier Controls

  • Working Context: Always enabled (required for coherent conversation)
  • Session Memory: Toggle on/off
  • Long-term Memory: Toggle on/off
  • Saved Memories: View and delete specific memories

Per-Mode Memory

Each mode configures which tiers it uses:

ModeMemory TiersContext Graph
QuickTier 1 onlyNo
ThinkTiers 1-2Yes
DeepAll tiersYes
ResearchAll tiersYes

Quick mode skips session, long-term, and graph retrieval for speed.

Rolling Window Compression

SemanticStudio uses progressive compression to handle unlimited conversation length. As conversations grow, older messages compress to save tokens while preserving meaning.

Rolling Window Compression

Watch messages compress as conversation grows

1365 tok
Turn 1
2488 tok
Turn 2
3412 tok
Turn 3
◀── Compressed ──▶◀── Full (recent) ──▶
Token Usage1,265 / 12,000
Full
Compressed (~50%)
Archived (~10%)

How it works: When messages exceed the threshold (20), older turns are compressed in batches of 6. If compression doesn't save 50%+ tokens, content is archived into the session summary.

Compression States

Messages transition through three states:

StateContentToken Impact
FULLComplete message verbatim100%
COMPRESSEDLLM-generated summary (100-200 words)~50%
ARCHIVEDMerged into session summary only~10%

Compression Triggers

  1. Message count: When full messages exceed 20, compression triggers
  2. Token budget: When total tokens exceed mode allocation
  3. Batch size: 6 messages (3 turns) compressed together
  4. Quality gate: If compression doesn't save 50%+ tokens, content is archived

Token Budget Management

The system manages token budgets automatically, with different allocations per mode:

Budget Allocation by Mode

ComponentQuickDefaultDeep
Total Budget4,00012,00024,000
Full Messages2,5006,00010,000
Compressed5003,0008,000
Session Summary5001,5003,000
Reserved Buffer5001,5003,000

Dynamic Adjustment

When budgets are exceeded:

  1. Compression: Older messages get compressed in batches
  2. Archival: Compressed content merges into session summary
  3. Prioritization: Higher-relevance content kept

The system never fails due to context overflow—it compresses gracefully.

How It All Works Together

Every turn flows through all 4 memory tiers—loading context before the response, then extracting and linking after. Here's the complete lifecycle:

Turn Lifecycle Flow

How memory processes your message through all 4 tiers

Processing Query:

"What's the churn risk for Texas customers?"

Step 1 of 6: Load Working Context17%
Working
Session
Longterm
Graph

Load Working Context

Tier 1: Always-included context loaded first

Operations
  • Last 3 conversation turns (6 messages)
  • Current session summary
  • Active system instructions
Results
  • Turn 5: Asked about Q4 revenue
  • Turn 6: Discussed Texas region
  • Turn 7: Current query

Key insight: Every turn flows through all 4 memory tiers—loading context before the response, then extracting and linking after. This ensures you never lose important context while keeping token usage efficient.

Memory in Action

Let's trace how memory works in a real conversation:

Turn 1: User Query

User: "What was our Q4 revenue?"

Working Context:
- System prompt
- User query

Retrieval:
- Session memory: (empty - new session)
- Long-term memory: "User is Finance Manager"
- Documents: Q4 financial reports

Response: "Q4 revenue was $12.4M, up 15% from Q3..."

Turn 2: Follow-up

User: "How does that compare to our forecast?"

Working Context:
- System prompt
- Previous exchange (Q4 revenue)
- Current query

Memory Extraction:
- Fact: "Q4 revenue was $12.4M" → Session memory

Retrieval:
- Session memory: "Q4 revenue was $12.4M"
- Documents: Q4 forecast data

Response: "The $12.4M exceeded forecast by 8%..."

Turn 3: Related Question

User: "Remember, my bonus depends on hitting targets."

Working Context:
- System prompt
- Last 3 exchanges
- Current query

Memory Extraction:
- Fact: "User's bonus depends on hitting targets" → Long-term memory
  (explicitly requested save, cross-session value)

Response: "Understood. I'll keep that context in mind..."

Later Session

User: "How are we tracking against targets?"

Long-term Memory Retrieved:
- "User is Finance Manager"
- "User's bonus depends on hitting targets"

The system remembers context from previous sessions.

Privacy Considerations

Memory raises privacy questions. SemanticStudio addresses them:

User Control

  • Users can disable long-term memory entirely
  • Users can view and delete any saved memory
  • Users can clear session memory at any time

Data Isolation

  • Each user's memory is isolated
  • No cross-user data leakage
  • Admins cannot access user memories

Retention Policies

  • Session memory clears automatically
  • Long-term memory persists until deleted
  • No automatic expiration (user controls retention)

Best Practices

From building and running SemanticStudio:

When to Enable Long-term Memory

Enable for:

  • Users who want personalized experiences
  • Workflows with recurring context
  • Preference-sensitive interactions

Disable for:

  • Privacy-sensitive environments
  • One-time queries
  • Testing and development

Optimizing Memory Extraction

The extractor works best when:

  • Conversations are focused (one topic per session)
  • Users are explicit about preferences
  • Questions build on previous context

Token Budget Tuning

Adjust budgets based on:

  • Average query complexity
  • Document length in your corpus
  • Response length requirements

Connection to Context Engineering

This implementation reflects principles from Context Engineering & Attention:

  1. Position matters: Important content in Tier 1 stays visible
  2. Relevance filtering: Only relevant facts retrieved
  3. Budget discipline: Never exceed capacity
  4. Strategic compression: Summarize rather than truncate

Memory tiers are context engineering in production.

What's Next

Memory determines what context is available. But for relationship-based queries, vector similarity isn't enough—you need to understand how entities connect.

Next up: Part 6 — GraphRAG-lite, where we explore knowledge graphs and relationship-aware retrieval.