brianletort.ai
All Posts
Context EngineeringContext CompilationRAGMCPArchitectureEnterprise AI

What Context Engineering Actually Means

RAG, MCP, memory systems, fine-tuning, prompt caching, AGENTS.md, knowledge graphs — everyone has a piece of the context puzzle. Nobody has the whole picture. Here's what's missing and why it matters.

April 13, 202612 min read

TL;DR

  • Eight techniques — prompt engineering, RAG, memory, MCP, AGENTS.md, knowledge graphs, fine-tuning, prompt caching — each solve part of the context problem. None solve the whole thing.
  • Each technique assumes the others are handled. RAG assumes the data is there. MCP assumes context is assembled. Memory assumes retrieval works. Nobody owns end-to-end compilation.
  • What's missing: a standard for compiled context packs, a portable intermediate representation, governance across techniques, and metrics for compilation quality.
  • Context engineering is the practitioner's discipline. Context compilation is the systems layer that makes it rigorous at scale.

"Context engineering" has become the term everyone reaches for when they mean "getting the right information to the model." But ask five people what it means and you'll get six answers.

For prompt engineers, it means structuring inputs. For RAG practitioners, it means retrieval pipelines. For tool builders, it means MCP. For coding agents, it means AGENTS.md and cursor rules. For knowledge graph teams, it means entity-aware retrieval.

None of these are wrong. But none of them are complete. Each solves one piece of the puzzle and assumes someone else handled the rest. That gap — the space between "we have all the techniques" and "they work together end-to-end" — is where enterprise AI systems break in production.

This post maps the landscape, identifies what falls between the cracks, and explains why the answer isn't another technique — it's a systems layer.

The Landscape: Eight Approaches, Eight Partial Solutions

Let me walk through the major approaches to getting context into AI systems. For each one: what it genuinely solves, and where it stops.

The Context Engineering Landscape

Eight techniques, eight partial solutions. Click any to see what it covers and what it misses.

The pattern: Each technique solves one layer but assumes the others are handled. RAG assumes the data is there. MCP assumes context is already assembled. Memory systems assume retrieval will find the right things. Nobody owns end-to-end compilation.

Prompt Engineering

The original context technique. You structure the model's input: system prompts, few-shot examples, instructions, constraints. It works — for simple tasks with stable data.

It breaks when the data changes (you're hardcoding context), when you need multiple sources (manual assembly doesn't scale), and when different users or runtimes need different views of the same information. Prompt engineering is a runtime concern masquerading as a context solution.

RAG

Retrieval-Augmented Generation correctly identifies that models need external evidence to answer knowledge-intensive questions. You chunk documents, embed them, store them in a vector database, and retrieve the top-K chunks at query time.

RAG is necessary. But retrieval is not compilation. Finding relevant chunks and dumping them into a prompt leaves out deduplication, budget management, policy filtering, source verification, and provenance tracking. Research suggests that 80% of RAG failures originate in the ingestion and chunking layer, not the LLM itself. And RAG has no concept of governance — no access controls, no sensitivity levels, no redaction at retrieval time.

Memory Systems

Systems like Zep, Mem0, and Letta improve what the model can access across sessions. They persist facts, preferences, conversation history, and entity relationships. This is valuable — without memory, every interaction starts from zero.

But memory systems are storage and recall layers. They determine what can be accessed, not what should be included right now. They don't provide compile-time control: no budget allocation, no policy filtering, no decision about which version of a fact is current. And persistent memory without governance creates new attack surfaces — if anyone can write to memory via an LLM-mediated path, memory poisoning becomes a real vulnerability.

MCP (Model Context Protocol)

MCP standardizes how models interact with tools — a significant infrastructure contribution. It enables composable agent-tool ecosystems and replaces ad-hoc API wrappers with a standard protocol.

But MCP doesn't address context compilation. It assumes the model already has the right context and just needs access to tools. In practice, MCP's JSON-RPC tool descriptions consume 550–1,400 tokens each — one deployment reported 72% of its 200K context window consumed by tool descriptions alone. MCP has no built-in authentication, no tenant isolation, and no trust infrastructure. It standardizes interaction surfaces, not governed working sets.

A separate pattern worth noting: agents that construct context through iterative tool use — ReAct, function calling, chain-of-thought with tools. This is fundamentally different from pre-compiled packs; the agent decides what to gather as it reasons. Context compilation relates as the agent's starting context: the governed working set it begins with before using tools to gather additional information. Compilation gives the agent a head start; tool use extends it.

AGENTS.md / Rules / Skills

For AI coding agents, persistent instruction files (AGENTS.md, .cursor/rules, Skills) give agents project-specific conventions, specialized workflows, and institutional knowledge. Well-structured rules can dramatically improve agent behavior.

But these are static instruction files, not compiled context. They tell the agent how to behave — not what evidence to reason with. They don't handle multi-source data assembly, governance, dynamic budget management, or cross-agent context sharing. They operate at the runtime surface, not the compilation layer.

Knowledge Graphs

Knowledge graphs add relational structure that vector search alone can't capture. They enable queries like "all open action items for people I met with last week" — traversals that semantic similarity search can't answer.

But graphs are retrieval substrates, not compilation layers. Finding that "Alice manages Bob who committed to deliver X by Friday" is retrieval. Deciding that this fact should be in the context pack — with provenance, under budget, filtered by the caller's clearance level — is compilation. Graphs make retrieval richer. They don't make it governed.

Fine-Tuning and Distillation

Fine-tuning is the compile-time alternative to runtime context. Instead of retrieving knowledge at inference time, you bake it into the model's weights through continued pre-training, LoRA adapters, or distillation. For stable, high-frequency knowledge — domain terminology, regulatory frameworks, organizational structure — this eliminates retrieval latency entirely.

But fine-tuning can't handle dynamic data. Retraining is slow, expensive, and can't keep up with daily changes in email, meetings, and project status. More fundamentally, fine-tuned knowledge has no governance layer: you can't serve different users with different permission levels from the same fine-tuned model, you can't trace an answer's provenance to a source document, and you can't redact specific content without retraining. For anything that changes — and in enterprise environments, most things change — runtime context compilation is necessary.

Prompt Caching

Prompt caching — Anthropic's cached completions, OpenAI's KV-cache reuse — can reduce costs by up to 90% for repeated context and significantly improve time-to-first-token. It's a meaningful infrastructure optimization.

But caching reduces cost, not complexity. The model still attends to the full cached context — attention dilution remains. Cached context has no governance layer, no provenance tracking, and can go stale silently. And caching doesn't solve the compilation problem: it makes whatever you cached cheaper to use, but it doesn't make the cached content better-selected, better-governed, or better-suited to the task. Caching makes bad context cheaper. Compilation makes context good.

The Pattern: Nobody Owns the End-to-End

Every technique in the landscape solves one layer and assumes the others are handled:

  • RAG assumes the data is there and well-chunked
  • MCP assumes the model already has the right context
  • Memory systems assume retrieval will find the right things at the right time
  • Knowledge graphs assume someone will compile their traversal results into a governed working set
  • AGENTS.md assumes the evidence the agent needs is already available
  • Fine-tuning assumes the knowledge is stable enough to bake into weights
  • Prompt caching assumes the cached context is still correct and well-composed
  • Prompt engineering assumes everything upstream worked

Nobody owns the end-to-end compilation.

What's Actually Missing

The gap isn't another technique. It's a systems layer. Specifically:

  1. No standard for what a "compiled context pack" should contain. Every system assembles context differently. There's no shared vocabulary for what a governed working set looks like.

  2. No portable intermediate representation between retrieval and runtime. Retrieval produces candidates. Runtimes consume formatted prompts. The object in between — the governed, structured, provenance-preserving working set — doesn't have a standard representation.

  3. No governance layer that works across techniques. Policy enforcement, sensitivity levels, domain ACLs, and redaction should be applied consistently regardless of whether the context came from RAG, a knowledge graph, a memory system, or an MCP tool.

  4. No metrics for compilation quality beyond answer accuracy. Standard benchmarks ask "was the answer correct?" Nobody asks "was the context pack well-compiled?" — grounded, governed, provenance-preserving, budget-efficient.

  5. No separation between "what to include" and "how to render it." Selection and governance decisions should happen once. Formatting for chat vs. agent vs. voice vs. API should happen separately. Today these are entangled.

The Composable Experience Problem

Here's what makes these gaps urgent: the number of experiences consuming AI context is growing faster than the number of teams that can build bespoke retrieval chains for each one.

Today, a meeting prep feature has its own RAG pipeline. The chat interface has a different one. The executive dashboard queries a third. The agent loop retrieves context its own way. Each pipeline makes its own retrieval decisions, applies (or doesn't apply) its own policy filters, manages (or doesn't manage) its own token budget. Add a new data source — say, Jira or Salesforce — and you have to wire it into every pipeline separately.

This is the composable experience problem: every new experience multiplies the retrieval and governance work, because there's no shared compilation layer.

A context pack service changes this equation:

  • One compilation, many experiences. A meeting prep context pack compiled once can be lowered into a chat message, a pre-meeting brief email, a voice assistant summary, a dashboard widget, and an agent's working memory — without re-running retrieval or policy checks.
  • Swap models without re-engineering context. The compiled IR is model-agnostic. Move from GPT-4o to Claude to Gemini to a local model — the context pack doesn't change. Only the lowering step adapts.
  • Scale data sources once. Adding a new data connector (CRM, ticketing system, Slack) enriches every experience automatically, because all experiences consume compiled packs from the same service.
  • Consistent governance everywhere. Policy decisions made in IR space — who can see what, what sensitivity level applies, what gets redacted — apply to every downstream experience. No accidental scope leakage because a different UI path had different filtering logic.

This isn't a theoretical benefit. It's an architectural choice that determines whether your AI platform scales linearly with experiences or collapses under the weight of N pipelines for N use cases.

The Missing Layer: Context Compilation

This is the argument behind Context Compilation Theory: the landscape doesn't need another technique. It needs a systems layer between retrieval and reasoning.

The paper reduces the argument to three lines:

Memory determines what a system can know.

Retrieval identifies candidate evidence.

Context compilation determines what the system actually thinks with.

Retrieve vs. Compile

Traditional RAG finds documents. A Context Compiler builds governed, budgeted context packs.

Traditional RAG

Query
Vector Search
Top-K Docs
Dump into Prompt
~50K tokens, unfiltered

Context Compiler

Query + Intent
Multi-Channel Retrieval
Budget Allocation
Dedup + Policy Filter
Scope Enforcement
Compiled Pack
~5K tokens, governed

A context compiler doesn't replace RAG, memory, or MCP. It sits above them. It takes candidates from any retrieval channel — semantic search, lexical search, graph traversal, entity lookup, memory recall — and compiles them into a governed working set under budget, latency, and policy constraints.

Where Each Technique Fits

The six-layer architecture makes the relationship explicit:

The Six-Layer Context Architecture

The stable asset is not a prompt or a UI — it is the governed context layer (3-4) that survives source, model, and interface changes.

L1Source Context
L2Memory & Knowledge Substrate
L3Context Compilerstable core
L4Context IRstable core
L5Lowering
L6Experience Runtimes

The architectural insight: Layers 3-4 (Compiler + IR) form the governed context layer that decouples what the system thinks with from how sources are stored (below) and how outputs are rendered (above). Change the model, the UI, or the data source — the compiled working set stays stable.

Each technique in the landscape maps to a specific layer:

TechniqueLayerRole
Fine-tuning, distillationPre-runtime (model weights)Static knowledge embedding
Knowledge graphs, memory systemsL2: SubstrateStorage and indexing
RAG, retrieval pipelinesL2-L3: Substrate + CompilerCandidate identification
Context compilationL3-L4: Compiler + IRGoverned working set assembly
MCP, AGENTS.md, prompt engineeringL5-L6: Lowering + RuntimeTool access and rendering
Prompt cachingInfrastructure (below L6)Cost optimization for repeated context

The gap in the current landscape is layers 3-4: the compiler and the intermediate representation. That's the layer that turns retrieval candidates into a governed, portable, provenance-preserving working set — before lowering into any specific runtime.

Measuring Compilation Quality

If you compile context at enterprise scale, you need metrics for compilation quality — not just answer accuracy. Standard benchmarks ask "was the answer correct?" They don't ask "was the context pack well-compiled?" We propose eight metrics covering evidence density, governance enforcement, provenance integrity, adversarial safety, and compilation efficiency. Five are measured on a live system. All five meet or exceed their targets.

Compilation as Optimization

Context compilation isn't just "better retrieval." It's a constrained optimization problem where relevance is only one of several competing objectives:

Context Compilation as Optimization

Compilation quality is never only about relevance — cost, latency, policy risk, and provenance loss all matter.

C* = argmax_C

U(C, T)λ₁ Cost(C)λ₂ Latency(C)λ₃ PolicyRisk(C)λ₄ ProvenanceLoss(C)
U(C, T)

Downstream task utility

Cost(C)

Token + compute cost

Latency(C)

End-to-end response time

PolicyRisk(C)

Scope and access violations

ProvenanceLoss(C)

Source lineage lost in transformation

A good context pack must balance downstream utility against token cost, latency, policy risk, and provenance loss. The two-stage formulation separates what to include (IR-space optimization) from how to render it (runtime lowering). This separation means governance decisions are made once and the same working set can be lowered into chat, agent, voice, or API without re-running policy checks.

The Practitioner's Takeaway

Context engineering is the discipline — the practice of deliberately structuring what goes into a context window and how it's organized. It's real, it's important, and every technique in the landscape contributes to it.

Context compilation is the systems layer that makes context engineering rigorous at scale. You can practice context engineering without it — and for simple use cases, you should. But past a certain scale — multiple data sources, policy constraints, multiple output runtimes, changing models — ad hoc context assembly breaks down.

Even with 10M-token context windows, you still need to decide what goes in, who's allowed to see it, where it came from, and whether it's current. Those are compilation decisions, not window size decisions. Bigger windows raise the ceiling. Compilation determines the floor.

The data will grow substantially. Models will change rapidly. We'll have chat interfaces, voice assistants, agent loops, API consumers, and experiences we haven't built yet — all needing governed context from the same evidence base. The question isn't whether you need a compilation layer. It's how long you can get by without one.


The full theory is formalized in Toward a Theory of Context Compilation for Human-AI Systems, available as a preprint. The measurement framework is in the Context Compilation blog series. The reference implementation is MemoryOS.