brianletort.ai
All Posts
Context EngineeringMemoryOSOpen SourceCompileBenchContext Compilation

The Evidence

Eight metrics measured on a live system. The CRR journey from 48.6% to 100%. CompileBench: the benchmark that evaluates compilation decisions. And the open standard proposal.

April 11, 20268 min read

TL;DR

  • All five measured metrics meet or exceed targets — ED 100%, SEA 100%, PLR 0%, PSR 0% (49/49), CRR 100%
  • Citation Resolution Rate went from 48.6% to 100% across three versions — the metrics created the feedback loop that drove the improvement
  • CompileBench proposes how to evaluate compilation decisions directly: what was selected, filtered, compressed, or excluded by policy
  • Two papers form a coherent research stack: metrics (what to measure) + theory (how to architect) + CompileBench (how to evaluate)

In Part 1, I argued that existing benchmarks only measure retrieval — the easiest part. In Part 2, I introduced Context Compilation Theory: the missing systems layer between access and reasoning.

This final part brings the evidence. Measured metrics that validate the architecture. The CRR journey that demonstrates metric-guided engineering. CompileBench as the evaluation agenda. And a proposal for open standards.

Measured on a live MemoryOS instance

0

Documents

0

Vector Chunks

0

Structured Entities

0

Data Sources

The Metrics That Validate the Theory

The eight metrics defined in the Context OS Metrics specification measure what standard benchmarks miss. Five are measured on a live system. Three require labeled evaluation datasets and are ready to run.

#MetricWhat it measuresResultTarget
1Evidence DensityGrounding of compiled output100%≥ 85%
2Pack Relevance ScorePacking precision under budgetEval ready≥ 70%
3Contradiction Detection F1Temporal consistency awarenessEval ready≥ 0.70
4Scope Enforcement AccuracyPolicy correctness at retrieval100%≥ 99.9%
5Permission Leakage RatePost-retrieval safety0%< 0.1%
6Poisoning Susceptibility RateWrite-path attack resistance0% (49/49)< 5%
7Citation Resolution RateProvenance verifiability100%≥ 95%
8Context Compiler EfficiencyEconomic case for compilationEval ready> 1.2x

Evidence Density: 100%

Every token in every context pack traces to a source document via the source_refs provenance chain. This includes entity items — action items, decisions, commitments — which carry their source document path through the MemoryObject base class.

Evidence Density by Intent

Every pack token traces to a source document via lineage —100%

25%50%75%100%Target ≥85%Executive Update100%Follow-Up100%Weekly Review100%General Query100%Meeting Prep100%Project Status100%
Source-traced tokens
Derived context (entity summaries, graph output)

An important distinction: ED measures structural traceability — whether a source path exists and resolves — not whether the linked source is semantically the best match. ED is necessary but not sufficient; CRR complements it by verifying that citations resolve to real artifacts.

Scope Enforcement and Permission Leakage

100% SEA. Tested across 32 individual scope decisions — 4 principal types evaluating 8 crafted candidates at 4 sensitivity levels across 2 domain types. The policy engine is deterministic: the sensitivity matrix is fixed, and every decision is logged.

Sensitivity Matrix: OutputTransformPolicy

Maps (caller clearance × content sensitivity) → output transform. Click any cell to see what happens.

Sensitivity ↓ / Principal →OwnerInternalExternal
Public
Internal
Confidential
Restricted
exact
abstracted
masked
denied

0% PLR. When the OutputTransformPolicy returns DENIED, the restricted content produces an empty string. It never enters the pack, never reaches the model, and can't leak because it was never there. We believe pre-retrieval filtering is the strongest approach for this threat model. Alternative approaches — post-generation filtering, differential privacy — address the same concern differently and may be appropriate in other contexts.

Poisoning Resistance: 49/49 Blocked

Adversarial Attack Resistance

49 test cases across 5 attack families — structured ingestion pipelines block all injection attempts

0/49

Blocked

Write Injection

15 test cases

Scope Escalation

12 test cases

Persistence Abuse

8 test cases

Provenance Forgery

8 test cases

Cross-Tenant Leakage

6 test cases

Why 0% susceptibility: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation, not through LLM-mediated commands.

The architectural reason is straightforward: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation. This is not a guardrail — it's an architecture. A guardrail can be bypassed with a clever enough prompt. An architecture that doesn't have the vulnerable pathway can't be bypassed because the pathway doesn't exist.

A note on these results: perfect scores should invite scrutiny of the test suite, not just confidence in the system. Our test coverage — 32 scope decisions, 49 adversarial cases, 185 citation checks — is meaningful but not exhaustive. We expect these numbers to face downward pressure as test suites grow. Publishing the test suite openly is how we invite that pressure.

The CRR Journey: 48.6% to 100%

Citation Resolution Rate is the metric that best demonstrates why publishing honest measurements matters.

Citation Resolution Rate: Metric-Guided Engineering

Publishing 48.6% created the feedback loop that drove it to 100%

100%

CRR (current)

0%25%50%75%100%Target48.6%v1Channel-Level91.4%v2Item-Level100%v3Full Provenance
48.6%

Measured at the retrieval channel level

91.4%

Shifted to per-item citation verification

100%

Entity items inherit source_refs from MemoryObject

The feedback loop: Publishing the honest 48.6% identified the structural gap. Each version addressed a specific measurement finding. This is exactly the engineering feedback loop these metrics are designed to create.

v1 (48.6%): CRR was measured at the retrieval channel level. Channels returning 0 candidates counted as unresolvable — valid provenance, but the metric exposed that our measurement approach was too coarse.

v2 (91.4%): CRR shifted to per-item verification. Most items resolved, but entity items lacked document_path. The metric identified exactly which items had the gap.

v3 (100%): Entity items now inherit source_refs[0].path from their parent MemoryObject. Every item resolves to a real artifact. Verified across 185 items from 6 context pack assemblies.

Publishing the honest 48.6% created the feedback loop. Each measurement identified a specific structural gap. Each version fixed it. This is exactly the engineering cycle these metrics are designed to enable.

What We Haven't Measured Yet

Five of eight metrics are measured. Three are not, and intellectual honesty requires saying so clearly:

  • Pack Relevance Score (PRS) and Contradiction Detection F1 (CDF1) require labeled evaluation datasets that don't yet exist for personal work data at this scale.
  • Context Compiler Efficiency (CCE) requires paired evaluation runs comparing compiled packs against full-context baselines.

Context Compiler Efficiency

Illustrative example — not measured data. Shows the theoretical compiler argument.

Token VolumeTask Success50K tokensFull Context5K tokensCompiled10xfewer tokens72%74%same successCCE = (74/72) / (5K/50K)= 10.3x efficiency

The chart above is an illustrative example, not measured data. CCE requires paired evaluation runs. The harness is built; the evaluation is in progress.

All measured results come from a single MemoryOS instance processing one user's enterprise work data. We don't claim these specific numbers generalize across deployments. What generalizes is the metric framework and measurement methodology, which any system can apply to its own data.

Scalability

Scalability: 1K to 100M Documents

Stays flat as data volume grows — the compiler's evidence ratio is architecturally stable

92%94%96%98%100%1K10KCurrent72K100K1M10M100MDocument count

The 72K data point is measured. Projections to larger scales are based on algorithmic complexity analysis — O(log N) for ANN search, O(1) for policy lookup — and assume no architectural bottlenecks at higher volumes. These are architectural properties, not yet validated at 100M-document scale.

CompileBench: The Benchmark That Doesn't Exist Yet

Standard benchmarks ask if the answer was correct. CompileBench asks what happened during compilation. What was selected? What was omitted? What was summarized? What remained verbatim? Was provenance preserved? Was policy respected? Did the pack stay inside budget and freshness constraints?

CompileBench: Evaluating Compilation, Not Just Answers

Standard benchmarks ask if the answer was correct. CompileBench asks what happened during compilation.

Four Task Families

Long-Horizon Conversational

Multi-session conversations where earlier context must survive compilation across sessions

Project Cross-Session

Project-oriented tasks requiring compilation from multiple sources, people, and time ranges

Memory-and-Action Coupled

Tasks where the compiled context directly drives agent actions, not just answers

Governance-Sensitive Enterprise

Tasks where compilation must respect sensitivity levels, domain boundaries, and redaction policies

Compilation-Oriented Metrics

Task successContext relevanceFaithfulnessProvenance retentionPolicy complianceToken footprintLatencyStalenessPersonalization

The evaluative shift: Instead of asking only whether the final answer was correct, CompileBench exposes the compilation decisions themselves — what was selected, filtered, compressed, preserved verbatim, or excluded by policy. This is a benchmark specification, not a claim that the category already has a finished universal benchmark.

CompileBench is an evaluation agenda and benchmark specification, not a claim that the category already has a finished universal benchmark. The value is that it makes the evaluation target explicit enough for systems to expose and compare compilation behavior.

CompileBench is defined as code in the MemoryOS repository: evaluation/compilebench/ includes task families, baselines, and metrics.

The Research Stack

This series covers two published papers that form a coherent program:

The Research Stack

Three layers that form a coherent program: what to measure, how to architect, and how to evaluate.

CompileBench

Benchmark specification for context compilation

How should we evaluate compilation decisions across tasks and runtimes?

Context Compilation Theory
Read the paper

Architecture, Context IR, and the optimization formulation

What is the missing systems layer between retrieval and reasoning?

Context OS Metrics
Read the specification

Eight metrics for governed context systems

What properties should a Context Operating System measure?

The metrics paper defines what matters for governed context systems. The context compilation paper defines how to architect the layer that makes those properties legible. CompileBench sketches the benchmark overlay needed to test compilation quality directly.

Together, they argue that the future of AI systems depends not just on better models or more memory, but on a durable way to turn heterogeneous evidence into a governed working set that can survive changing models and changing interfaces.

The Open Standard Proposal

The eight metrics are Apache-2.0 licensed. The reference implementations are in evaluation/tools/novel_metrics.py. The safety test suite is published at evaluation/datasets/safety_suite_v1.jsonl.

We propose them as open standards for the Context OS category:

  1. Implement the PackItem, ScopeDecision, InjectionResult, and Citation data structures for your system's output format.
  2. Call the metric functions with your data.
  3. Report the results alongside standard IR metrics (Recall@K, MRR, nDCG).
  4. If you publish results, include system configuration, data scale, and test suite version for reproducibility.

The Eight Metrics — Open Standard Proposal

Apache-2.0 licensed · Reference implementations included

MetricCodeResultTargetStatus
Evidence DensityED100%≥ 85%Measured
Pack Relevance ScorePRS≥ 70%Eval Ready
Contradiction Detection F1CDF1≥ 0.70Eval Ready
Scope Enforcement AccuracySEA100%≥ 99.9%Measured
Permission Leakage RatePLR0%< 0.1%Measured
Poisoning Susceptibility RatePSR0% (49/49)< 5%Measured
Citation Resolution RateCRR100%≥ 95%Measured
Context Compiler EfficiencyCCE> 1.2xEval Ready

The AI memory space needs shared standards, not marketing benchmarks. When every vendor picks their own metric and optimizes for it, comparisons are impossible. When everyone runs the same eight metrics, the numbers speak for themselves.

All five measured metrics meet or exceed their targets. Three more are eval-harness-ready. The CRR progression from 48.6% to 100% demonstrates the core value: these metrics don't just measure systems — they create feedback loops that improve them.

If your AI memory vendor claims "enterprise-ready," ask them which of these eight they can run.


The full metrics specification is at github.com/Brianletort/MemoryOS. The context compilation theory is formalized in Toward a Theory of Context Compilation for Human-AI Systems. All code, metrics, and test suites are Apache-2.0 licensed.

Context Compilation

Part 3 of 3