The Benchmarks Are Lying to You

TL;DR

The dominant AI memory benchmarks (LongMemEval, LoCoMo, RULER) only measure retrieval — the easiest part of the problem
No existing benchmark tests governance, adversarial safety, provenance, or compilation efficiency
The gap isn't a feature gap — it's a category gap. The industry evaluates retrieval systems when it needs context compilers
We propose eight metrics and a theory of context compilation to fill the blind spots — with an open-source reference implementation

Your AI memory system just passed LongMemEval with flying colors. It retrieved the right answer from 500 curated questions across multiple sessions. The benchmark says everything is fine.

Meanwhile, it leaked confidential financial data to an external-facing agent. It accepted a memory poisoning attack that permanently biased future recommendations. And 40% of the tokens in its "context" were boilerplate filler that traced to nothing.

The benchmarks didn't catch any of this. They weren't designed to.

To be clear: these are well-designed benchmarks. They measure what they set out to measure — retrieval quality, conversational memory, context length robustness. The problem isn't that they're wrong. It's that they've become the de facto standard for evaluating a category that has outgrown them. When the industry uses retrieval benchmarks to assess context operating systems, the mismatch creates a false sense of coverage.

This series tells the story of how measuring that gap led to a theory — and an architecture — for what's missing. A formal academic treatment is available as a preprint; this blog covers the background, the challenge, and the journey.

The Convergence Trap

The AI memory and context engineering space has converged on a handful of benchmarks that everyone cites and nobody questions:

LongMemEval tests five long-term memory abilities across 500 curated questions.
LoCoMo evaluates multi-session conversational memory.
RULER reveals how model performance degrades as context length increases.

These are valuable benchmarks. They should be run. We run them on MemoryOS.

But they share a blind spot so fundamental that it renders them insufficient for evaluating what enterprises actually need: they only measure retrieval and answering in isolation.

This is like grading a bank on how fast it processes deposits while ignoring whether anyone can walk out with someone else's money.

Let me be specific about what these benchmarks don't test:

Can the system prove where its context came from? (Provenance)
Does the system enforce access controls at retrieval time? (Governance)
Can an attacker poison the system's persistent memory? (Adversarial safety)
Is the compiled output actually grounded in evidence? (Compilation quality)
Does compilation outperform just dumping everything into the window? (Economic efficiency)

Five questions. Zero coverage from the benchmarks everyone runs.

The Deeper Diagnosis: A Category Gap

The deeper problem is not a feature gap. It is a category gap.

The benchmarks were designed for retrieval systems — systems that find documents in response to queries. But the systems that enterprises need are not retrieval systems. They are context compilers.

A retrieval system finds documents. A context compiler assembles governed working sets.

Retrieve vs. Compile

Traditional RAG finds documents. A Context Compiler builds governed, budgeted context packs.

Traditional RAG

Context Compiler

A retrieval pipeline takes a query, runs vector search, pulls top-K documents, and dumps them into a prompt. The output is a ranked list of chunks. Whatever the model was asked is answered using whatever was retrieved, with no budget management, no deduplication, no policy filtering, and no source verification.

A context compiler does something fundamentally different. It takes a query with intent classification, runs multi-channel retrieval (semantic, lexical, graph, entity), allocates a token budget, deduplicates overlapping content, applies policy filters based on caller clearance and content sensitivity, and produces a compiled context pack — a budgeted, governed, traceable unit of context.

This distinction is the core thesis of our recent paper:

Memory determines what a system can know.

Retrieval identifies candidate evidence.

Context compilation determines what the system actually thinks with.

Recall@K answers: "Did we find the relevant documents?" But a system can score 100% Recall@K and still produce a context pack that is 40% padding, violates access policies, and contains poisoned entries. The benchmarks measure the retrieval pipeline. No benchmark measures the compilation pipeline.

Let me frame the gaps as questions that every enterprise deploying AI memory systems should be asking — and that no existing benchmark helps answer.

The question: When I retrieve context, does the system enforce who can see what?

Enterprises have classification levels. Internal documents shouldn't appear in external-facing agent responses. Restricted financial data shouldn't surface for interns. Patient records shouldn't cross department boundaries.

No existing benchmark tests whether retrieval respects access controls. Not one. The assumption is that retrieval is a neutral, policy-free operation. For a research system, that's fine. For an enterprise system, it's a liability.

The question: Can every claim in the output be traced to a real source?

"Hallucinated citations" are a known LLM failure mode — the model generates a plausible-looking reference that doesn't exist. In a context operating system, every provenance claim should be verifiable: this sentence came from this email, this meeting, this document.

No benchmark tests whether the system's "I got this from document X" claims are true.

The question: Can an attacker permanently compromise the system by injecting malicious content into its memory?

Persistent memory introduces an attack surface that stateless RAG does not have. If your system has a "remember X" API path — or any LLM-mediated write pathway — an attacker can potentially embed instructions that bias all future responses.

Standard prompt injection benchmarks (OWASP LLM Top 10) focus on response manipulation. They don't test write-path safety — whether an attacker can make the system permanently store something it shouldn't.

The question: Does your context compiler actually outperform just using a bigger window?

"Why not just use a 1M token context window?" is the most common objection to context engineering. RULER shows performance degrades with length, but it doesn't quantify the economic argument for compilation. Does a compiled 5K-token pack achieve the same task success as dumping 50K tokens of raw context? If so, that's a 10x efficiency gain.

No benchmark answers this question.

Measuring What's Missing

These blind spots aren't theoretical. They are measurable. We defined eight metrics that evaluate what standard benchmarks miss — and measured five of them on a live MemoryOS instance processing 72,290 documents, 428,957 vector chunks, 47,372 structured entities, and 192,314 knowledge graph relationships from 7 data sources.

#	Metric	What it measures	Result	Target
1	Evidence Density	Grounding of compiled output	100%	≥ 85%
2	Pack Relevance Score	Packing precision under budget	Eval ready	≥ 70%
3	Contradiction Detection F1	Temporal consistency awareness	Eval ready	≥ 0.70
4	Scope Enforcement Accuracy	Policy correctness at retrieval	100%	≥ 99.9%
5	Permission Leakage Rate	Post-retrieval safety	0%	< 0.1%
6	Poisoning Susceptibility Rate	Write-path attack resistance	0% (49/49)	< 5%
7	Citation Resolution Rate	Provenance verifiability	100%	≥ 95%
8	Context Compiler Efficiency	Economic case for compilation	Eval ready	> 1.2x

All five measured metrics meet or exceed their targets. But the results column doesn't tell the full story. Citation Resolution Rate started at 48.6% — a number we published openly. That transparency created a feedback loop: each version addressed a specific measurement finding, progressing from 48.6% (v1) to 91.4% (v2) to 100% (v3). The metrics didn't just measure the system — they improved it.

Disclosure: I designed these metrics and built the system under evaluation. To mitigate that bias, all metric definitions, test suites, and measurement code are published under Apache-2.0. I invite independent reproduction and welcome adversarial contributions to the safety test suite.

Benchmark Coverage: What Gets Measured?

Existing benchmarks cover retrieval. Context OS metrics cover everything else.

Existing Benchmarks (LongMemEval, LoCoMo, RULER)

The radar tells the story. Existing benchmarks light up two or three axes — retrieval, memory recall, context length stress. The Context OS metrics fill the rest: grounding, governance, safety, provenance, efficiency.

The five axes that no existing benchmark covers are not edge cases. They are the five things that determine whether an AI memory system is ready for production enterprise deployment or only impressive in demos.

Coverage Matrix: Benchmarks vs. Enterprise Requirements

Click any cell to see details. Existing benchmarks cluster in one column.

Metric / Benchmark	Retrieval	Grounding	Governance	Safety	Provenance	Efficiency
LongMemEval
LoCoMo
RULER
Evidence Density
Scope Enforcement
Permission Leakage
Poisoning Rate
Citation Resolution
Compiler Efficiency

Existing Benchmarks

Context OS Metrics

Not covered

The coverage matrix makes the gap impossible to miss. Existing benchmarks cluster in the "Retrieval" column. The Context OS metrics spread across all six requirement categories.

What Comes Next

But measuring the gaps told us something bigger than any individual metric result. It told us there is a missing systems layer between retrieval and reasoning — a layer that existing architectures don't make explicit.

This is a three-part series:

Part 2: The Missing Layer — How measuring gaps revealed a missing systems layer. Context Compilation Theory, Context IR, the six-layer architecture, and the optimization formulation. The architecture between access and reasoning.

Part 3: The Evidence — The measured results that validate the architecture. The CRR journey from 48.6% to 100%. CompileBench: the benchmark that doesn't exist yet. And the open standard proposal.

The metrics are Apache-2.0 licensed. The theory is formalized in a preprint. The test suite is published. If your AI memory vendor claims "enterprise-ready" — ask them which of these eight metrics they can run.

The reference implementation is MemoryOS. The eight metrics are defined in evaluation/tools/novel_metrics.py. The context compilation theory is formalized in Toward a Theory of Context Compilation for Human-AI Systems.

Context Compilation

Part 1 of 3

Part 2: The Missing Layer

The Benchmarks Are Lying to You

TL;DR

The Convergence Trap

The Deeper Diagnosis: A Category Gap

Retrieve vs. Compile

The Four Blind Spots

Blind Spot 1: Governance

Blind Spot 2: Provenance

Blind Spot 3: Adversarial Safety

Blind Spot 4: Compilation Efficiency

Measuring What's Missing

Benchmark Coverage: What Gets Measured?

Coverage Matrix: Benchmarks vs. Enterprise Requirements

What Comes Next