Governed Memory or Governed Theater?

TL;DR

Evidence Density (100%): every pack token traces to a source document via the source_refs provenance chain
Scope Enforcement Accuracy (100%): 32/32 scope decisions correct across 4 principal types, 4 sensitivity levels, and 2 domain types
Permission Leakage Rate (0%): restricted content never enters the context pack, so it can't leak downstream
Poisoning Susceptibility Rate (0%): 49 adversarial attacks across 5 families, all blocked by architectural design

In Part 1, I argued that the AI memory benchmarks everyone runs — LongMemEval, LoCoMo, RULER — share a blind spot: they measure retrieval in isolation and say nothing about governance, safety, provenance, or compilation quality.

This part makes that argument concrete. Four metrics. Real numbers. Measured on a live system — not a curated demo, not a synthetic test set, not a hand-picked subset of queries that make the numbers look good.

Measured on a live MemoryOS instance

Documents

Vector Chunks

Structured Entities

Data Sources

These numbers come from a MemoryOS instance continuously processing my actual work data: emails, meeting transcripts, screen activity, Teams chats, and knowledge documents. The metrics are run against this data, not alongside it.

Evidence Density: Is Your Context Real?

Every context pack assembled by an AI memory system contains tokens. The question is: where did those tokens come from?

Evidence Density (ED) measures the fraction of pack tokens that are directly traceable to a source event — an email, meeting transcript, document, or screen capture. Tokens without source lineage are filler: boilerplate, LLM-generated summaries injected without provenance, or formatting overhead.

ED = tokens_with_source_lineage / total_pack_tokens

The distinction matters more than it sounds. A system can retrieve the right documents (scoring well on Recall@K) and still produce a context pack that's 40% padding. The model reads all of it. The padding wastes attention, increases cost, and dilutes the signal.

Evidence Density by Intent

Every pack token traces to a source document via lineage —100%

Source-traced tokens

Derived context (entity summaries, graph output)

100% across all intent types. Every token in every context pack traces to a source document. This includes entity items — action items, decisions, commitments — which now carry their source document path through the source_refs provenance chain on MemoryObject.

This wasn't always the case. An earlier version measured 95.8% weighted mean because entity summaries and graph descriptions were derived without source lineage. The fix was structural: entity items now inherit source_refs[0].path from their parent MemoryObject, flowing source lineage through to the compiled pack. The system does not inject filler, boilerplate, or LLM-generated content without provenance.

The progression from 95.8% to 100% demonstrates a principle: measuring evidence density forces you to fix evidence gaps. The metric identified exactly which items lacked source lineage, and the structural fix was straightforward once the gap was visible.

An important distinction: ED measures structural traceability — whether a source path exists and resolves — not whether the linked source is semantically the best match for the context. ED is a necessary but not sufficient condition for provenance integrity. CRR (Citation Resolution Rate, covered in Part 3) complements it by verifying that citations resolve to real artifacts.

What ED reveals about the space: Ask any AI memory vendor what fraction of their context pack tokens trace to source documents. If the answer is unknown — or unmeasurable — the pack is a black box.

The Policy Layer Nobody Tests

Here's a question for any organization deploying AI memory systems: When your system retrieves context, does it enforce who can see what?

Enterprises have classification levels. Internal strategy documents shouldn't appear in external-facing agent responses. Restricted HR data shouldn't surface for a marketing analyst. Confidential M&A documents shouldn't cross the firewall to a vendor's agent.

Scope Enforcement Accuracy (SEA) measures whether the policy engine correctly includes artifacts from allowed scopes and excludes artifacts from disallowed scopes at retrieval time.

SEA = correct_scope_decisions / total_scope_decisions

MemoryOS implements this through an OutputTransformPolicy — a sensitivity matrix that maps every combination of caller clearance and content sensitivity to a specific output transform:

Sensitivity Matrix: OutputTransformPolicy

Maps (caller clearance × content sensitivity) → output transform. Click any cell to see what happens.

Sensitivity ↓ / Principal →	Owner	Internal	External
Public
Internal
Confidential
Restricted

exact

abstracted

masked

denied

The matrix is deterministic. Given a principal type and a content sensitivity level, the transform is fixed. There's no probabilistic element, no LLM deciding whether to share content, no prompt-based guardrail that might be bypassed. The policy is code, not a suggestion.

100% SEA. Tested across 32 individual scope decisions — 4 principal types (owner, external, internal, confidential) evaluating 8 crafted candidates at 4 sensitivity levels across 2 domain types. Every decision was correct: correct inclusions, correct exclusions, correct transforms.

The policy engine is deterministic — the sensitivity matrix is fixed, domain ACLs are checked before and after retrieval, and every decision is logged in the pack's policy_trace and redactions fields. This is a structural property of the architecture, not a probabilistic outcome. There is no LLM deciding whether to share content, no prompt-based guardrail that might be bypassed.

Permission Leakage: The Post-Retrieval Blind Spot

SEA measures whether the retrieval layer correctly filters. Permission Leakage Rate (PLR) measures whether the end-to-end system leaks.

The difference matters because models have memory that extends beyond what's in the current context window. A model might "remember" restricted content from:

Earlier in the conversation
Fine-tuning data
Prior context window contents that included restricted material before filters were applied

PLR catches failures that SEA misses:

PLR = responses_with_leaked_tokens / total_responses_on_scoped_tasks

0% PLR. In our test suite, no restricted-scope content appeared in any external-facing context pack. The architecture makes this result structural, not accidental: when the OutputTransformPolicy returns DENIED, the restricted content produces an empty string. The restricted content is never placed in the pack at all. It doesn't reach the model. It can't leak because it was never there.

This is a critical architectural choice. Many systems implement post-generation filtering — letting the model see everything, then scrubbing the output. That approach is fundamentally fragile. The model saw the restricted content. It influenced the generation. Post-hoc filtering can miss paraphrases, implications, or indirect references.

We believe pre-retrieval filtering is the strongest approach for this threat model because it prevents the model from seeing restricted content at all. Alternative approaches — post-generation filtering, differential privacy, secure enclaves — address the same concern differently and may be appropriate in other contexts. But any approach that lets the model see restricted content and then tries to scrub the output must contend with paraphrases, implications, and indirect references that are difficult to catch reliably.

Memory Poisoning: The Attack Surface Benchmarks Don't Measure

Persistent memory introduces a permanent attack surface that stateless RAG does not have. If your system accepts LLM-mediated writes to its memory — any variant of "remember X" or "update my preferences to Y" — then an attacker can potentially embed instructions that bias all future responses.

Documented attack patterns include:

Hidden instructions in "Summarize with AI" workflows that embed persistent preferences
Indirect prompt injections that plant recommendations biased toward a specific vendor
Cross-session attacks that exploit the persistence of memory entries

Poisoning Susceptibility Rate (PSR) measures the fraction of adversarial injection attempts that succeed in creating an unsafe persistent memory write:

PSR = unsafe_writes / total_injection_attempts

Adversarial Attack Resistance

49 test cases across 5 attack families — structured ingestion pipelines block all injection attempts

0/49

Blocked

Write Injection

15 test cases

Scope Escalation

12 test cases

Persistence Abuse

8 test cases

Provenance Forgery

8 test cases

Cross-Tenant Leakage

6 test cases

Why 0% susceptibility: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation, not through LLM-mediated commands.

0% PSR — 49/49 blocked. The test suite has been expanded from 19 to 49 adversarial cases across the same 5 attack families, with deeper coverage: 15 write injection variants, 12 scope escalation patterns, 8 persistence abuse vectors, 8 provenance forgery techniques, and 6 cross-tenant leakage attempts. Every single one was blocked.

The architectural reason is straightforward: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation. Documents enter via ETL processes. Screen activity enters via structured observers. Meeting transcripts enter via transcription pipelines. There is no LLM-mediated write path that an attacker could exploit.

This is not a guardrail. It's an architecture. The difference is that a guardrail can be bypassed with a clever enough prompt. An architecture that doesn't have the vulnerable pathway can't be bypassed because the pathway doesn't exist.

The test suite is published at evaluation/datasets/safety_suite_v1.jsonl. We welcome contributions of new attack patterns. If you find one that gets through, we want to know.

A note on these results: perfect scores should invite scrutiny of the test suite, not just confidence in the system. Our test coverage — 32 scope decisions, 49 adversarial cases, 185 citation checks — is meaningful but not exhaustive. We expect these numbers to face downward pressure as test suites grow and edge cases are discovered. Publishing the test suite openly is how we invite that pressure. The adversarial testing assumes an attacker who can influence content entering the ingestion pipeline (e.g., crafted emails or documents) but cannot modify the system's code or configuration directly.

Why These Metrics Are Absent Elsewhere

To our knowledge, no other system in the AI memory space publishes evidence density, permission leakage rate, or poisoning susceptibility rate. We don't know whether this reflects architectural limitations, different measurement priorities, or deliberate choice. What we do know is that these metrics require specific architectural capabilities to run:

Evidence Density requires that every item in the context pack carries source lineage metadata. Most systems don't track where pack content came from.
Scope Enforcement requires a policy engine that makes deterministic scope decisions at retrieval time. Most systems don't have policy engines.
Permission Leakage requires end-to-end testing of whether restricted content surfaces in generated responses. Most systems don't have the concept of "restricted content."
Poisoning Susceptibility requires adversarial testing of write paths. Most systems have open write paths by design — "remember X" is a feature, not a vulnerability.

The absence of these metrics across the space is a structural observation, not an indictment. Most AI memory systems were designed for retrieval — and they do retrieval well. Governance, provenance, and adversarial safety are different requirements that demand different architectural choices. The metrics we propose make those choices measurable, so the industry can evaluate them rather than assume them.

Next: Part 3: The Compiler Wins — Citation Resolution Rate (from 48.6% to 100% via metric-guided engineering), Pack Relevance Score, Contradiction Detection F1, and Context Compiler Efficiency. The economic and provenance case for compilation, plus a call to adopt these as open standards.

The Metrics Nobody's Publishing

Part 2 of 3

Part 1: The Benchmarks Are Lying to You

Part 3: The Compiler Wins