TL;DR
- Citation Resolution Rate went from 48.6% to 100% across three versions — the metrics created the feedback loop that drove the improvement
- Pack Relevance Score and Contradiction Detection F1 are eval-ready: the metrics are defined, the harness is built, labeled datasets are the remaining dependency
- Context Compiler Efficiency is the economic argument: if 5K compiled tokens match 50K raw at equal task success, that's 10x efficiency
- All five measured metrics now meet or exceed targets — these eight metrics are Apache-2.0 licensed as a proposed open standard
In Part 1, I argued that existing benchmarks only measure retrieval — the easiest part. In Part 2, I showed the four governance and safety metrics that no competitor publishes.
This final part covers the remaining four metrics: provenance (Citation Resolution Rate), signal quality (Pack Relevance Score), temporal awareness (Contradiction Detection F1), and economics (Context Compiler Efficiency). Then I'll make the case for adopting all eight as open standards.
Let me start with the metric that best demonstrates why these measurements matter.
Citation Resolution: From 48.6% to 100%
Citation Resolution Rate is now 100%. But it started at 48.6%. The journey between those numbers is the strongest argument for why publishing honest metrics matters.
CRR measures the fraction of citations in a context pack that resolve to a real artifact in the system:
CRR = valid_citations / total_citations
A "valid citation" is a provenance entry whose referenced artifact exists in the store. This catches "hallucinated citations" — the failure mode where a system claims "This came from document X" but document X doesn't exist.
Citation Resolution Rate: Metric-Guided Engineering
Publishing 48.6% created the feedback loop that drove it to 100%
100%
CRR (current)
Measured at the retrieval channel level
Shifted to per-item citation verification
Entity items inherit source_refs from MemoryObject
The feedback loop: Publishing the honest 48.6% identified the structural gap. Each version addressed a specific measurement finding. This is exactly the engineering feedback loop these metrics are designed to create.
The progression tells the story of metric-guided engineering:
v1 (48.6%): CRR was measured at the retrieval channel level. Each pack recorded 5 provenance entries — one per channel (semantic, lexical, graph, entity, insights). Channels returning 0 candidates counted as unresolvable. This was valid provenance (documenting what was tried), but the metric exposed that our measurement approach was too coarse.
v2 (91.4%): CRR shifted to per-item verification — does each item in the pack trace to a real source document? Most items resolved, but entity items (action items, decisions, commitments) lacked document_path. The metric identified exactly which items had the gap.
v3 (100%): Structural fix. Entity items now inherit source_refs[0].path from their parent MemoryObject, flowing source lineage through to the compiled pack. Every item — including entities — resolves to a real artifact in the ledger. Verified across 185 items from 6 context pack assemblies.
The principle: Publishing the honest 48.6% created the feedback loop. Each measurement identified a specific structural gap. Each version fixed it. This is exactly the engineering cycle these metrics are designed to enable. Systems that don't measure can't improve with this precision.
Pack Relevance Score: Beyond Recall@K
Recall@K answers: "Is the relevant document somewhere in the top K results?"
Pack Relevance Score asks a harder question: "Of the tokens the agent will actually read, what fraction came from relevant sources?"
PRS = tokens_from_relevant_artifacts / total_pack_tokens
The difference is material at enterprise scale. A system with perfect Recall@K might include 10 relevant chunks alongside 40 distractors. The model reads all 50. The 40 distractors waste attention, increase latency, inflate token costs, and — as RULER demonstrates — actively degrade response quality.
PRS requires labeled evaluation datasets with required_evidence_artifact_ids per task. The evaluation harness is built (evaluation/runners/run_workmembench.py). The labeled dataset is the remaining dependency.
Target: ≥ 70%. A pack where 70%+ of tokens are relevant means the agent reads mostly signal, not noise.
Why this matters at enterprise scale: At 30M tokens per user per day (our projection from The Token Economy), a 30% irrelevance rate means 9M wasted tokens per user per day. At scale, PRS directly impacts your AI operating budget.
Contradiction Detection: Memory That Knows Time
Knowledge evolves. Budgets change. Decisions get reversed. Policies are updated. People change roles.
A system that treats all retrieved evidence as equally current will produce context packs that contain stale or contradictory information. "The Q3 budget is $5M" and "The Q3 budget was revised to $3.2M" can both exist in the knowledge base. A Context OS must know which one is current.
Contradiction Detection F1 measures the system's ability to detect temporal contradictions — cases where "Decision A was superseded by Decision B":
CDF1 = 2 * precision * recall / (precision + recall)
Where true positives are detected pairs that match gold-standard (superseded_id, superseding_id) pairs.
MemoryOS tracks supersession via the superseded_by field in the object store. When a new decision is ingested that reverses or updates a prior decision, the prior entry is flagged with a superseded_by reference. The context compiler can then prioritize the current version.
Target: ≥ 0.70. Detecting 70% of known contradictions with reasonable precision prevents most stale-information failures.
Status: Requires labeled datasets with annotated contradiction pairs. The system has the plumbing (superseded_by field). The evaluation harness has the metric. The labeled test data is the gap.
This is a metric that becomes increasingly critical with time. A system that's been running for a week has few contradictions. A system that's been running for a year — accumulating meeting decisions, budget revisions, policy updates — has thousands. Without contradiction detection, the context pack becomes a historical archive that doesn't distinguish between what was true and what is true.
The Compiler Efficiency Argument
"Why not just use a bigger context window?"
It's the most common objection to context engineering, and it deserves a quantitative answer.
Context Compiler Efficiency measures the economic ROI of compilation:
CCE = (success_rate_compiled / success_rate_full) / (tokens_compiled / tokens_full)
If a compiled 5K-token pack achieves the same task success as 50K tokens of full context, the compiler is 10x more token-efficient. CCE > 1.0 means compilation wins. CCE > 1.2 means compilation wins enough to justify the engineering investment.
Context Compiler Efficiency
Illustrative example — not measured data. Shows the theoretical compiler argument.
The chart above is an illustrative example, not measured data. CCE requires paired evaluation runs (compiled vs. full-context) on the same task set. The harness is built; the paired evaluation is in progress. We'll report real CCE numbers when the evaluation is complete.
The theoretical case is strong:
-
RULER demonstrates the attention problem. Model performance degrades as context length increases. The "lost in the middle" phenomenon means tokens in the middle of a 50K window get less attention than tokens at the beginning and end. A compiled 5K pack doesn't have a "middle" — every token is in the attention-dense zone.
-
Token economics compound. At enterprise scale, the difference between 5K and 50K tokens per query is the difference between a manageable AI budget and one that triggers cost-optimization projects. A 10x reduction in token volume at equal task success is a 10x reduction in inference cost.
-
Latency scales with tokens. Time-to-first-token is influenced by prompt size. Smaller, more focused context means faster responses. For real-time use cases (meeting prep, live chat augmentation), this matters.
Status: Requires paired evaluation runs — compiled vs. full-context — on the same task set. The harness supports this via two modes. We will report CCE once the paired evaluations are complete.
What We Haven't Measured Yet
Five of eight metrics are measured. Three are not, and intellectual honesty requires saying so clearly:
- Pack Relevance Score (PRS) and Contradiction Detection F1 (CDF1) require labeled evaluation datasets that don't yet exist for personal work data at this scale. The metric definitions are formalized, the evaluation harness is wired, and the gap is the labeled test data.
- Context Compiler Efficiency (CCE) — the metric behind this post's title — requires paired evaluation runs comparing compiled packs against full-context baselines. The title "The Compiler Wins" reflects the theoretical case and the architectural design intent, not a measured result. If CCE comes back below 1.2x, we'll publish that too.
All measured results come from a single MemoryOS instance processing one user's enterprise work data. We don't claim these specific numbers generalize across deployments. What generalizes is the metric framework and measurement methodology, which any system can apply to its own data.
Enterprise Scale: Why This Matters at 100M Documents
A reasonable question: do these metrics hold up at enterprise scale?
Scalability: 1K to 100M Documents
Stays flat as data volume grows — the compiler's evidence ratio is architecturally stable
The 72K data point in the chart below is measured. Projections to larger scales are based on algorithmic complexity analysis and assume no architectural bottlenecks emerge at higher volumes — which remains to be validated at scale.
-
Evidence Density stays flat. ED is a ratio of traced vs. untraced tokens in the compiled pack. The compilation process is independent of index size — it depends on the quality of the compilation step, not the volume of data behind it. Whether you have 10K or 100M documents, the compiler's evidence density ratio remains architecturally stable.
-
Policy Enforcement is O(1). The sensitivity matrix is a constant-time lookup:
(caller_clearance, object_sensitivity) → output_transform. No iteration over the data set. No scanning of access control lists. Adding more documents does not slow down policy enforcement. -
Retrieval Latency grows logarithmically. Vector index search (ANN) scales with O(log N), not O(N). Going from 72K to 100M documents approximately doubles retrieval latency — from ~35ms to ~130ms — not a 1,400x increase.
These are architectural properties, not yet validated at 100M-document scale. But the complexity analysis suggests that the metrics framework remains viable as data volumes grow — the metrics don't become harder to compute, and the properties they measure don't degrade structurally with scale.
The Open Standard Proposal
These eight metrics are Apache-2.0 licensed. The reference implementations are in evaluation/tools/novel_metrics.py. The safety test suite is published at evaluation/datasets/safety_suite_v1.jsonl.
We propose them as open standards for the Context OS category. Here's how to adopt them:
- Implement the
PackItem,ScopeDecision,InjectionResult, andCitationdata structures fromnovel_metrics.pyfor your system's output format. - Call the metric functions with your data.
- Report the results alongside standard IR metrics (Recall@K, MRR, nDCG).
- If you publish results, include system configuration, data scale, and test suite version for reproducibility.
The Eight Metrics — Open Standard Proposal
Apache-2.0 licensed · Reference implementations included
| Metric | Code | Result | Target | Status |
|---|---|---|---|---|
| Evidence Density | ED | 100% | ≥ 85% | Measured |
| Pack Relevance Score | PRS | — | ≥ 70% | Eval Ready |
| Contradiction Detection F1 | CDF1 | — | ≥ 0.70 | Eval Ready |
| Scope Enforcement Accuracy | SEA | 100% | ≥ 99.9% | Measured |
| Permission Leakage Rate | PLR | 0% | < 0.1% | Measured |
| Poisoning Susceptibility Rate | PSR | 0% (49/49) | < 5% | Measured |
| Citation Resolution Rate | CRR | 100% | ≥ 95% | Measured |
| Context Compiler Efficiency | CCE | — | > 1.2x | Eval Ready |
The AI memory space needs shared standards, not marketing benchmarks. When every vendor picks their own metric and optimizes for it, comparisons are impossible. When everyone runs the same eight metrics, the numbers speak for themselves.
All five measured metrics meet or exceed their targets. Three more are eval-harness-ready. The CRR progression from 48.6% to 100% demonstrates the core value: these metrics don't just measure systems — they create feedback loops that improve them.
If your AI memory vendor claims "enterprise-ready," ask them which of these eight they can run.
The full metrics specification is at github.com/Brianletort/MemoryOS. The reference implementation, evaluation harness, and safety test suite are Apache-2.0 licensed. PRs extending the metrics, adding test cases, or proposing new metrics are welcome.