The Data Layer Nobody's Building

TL;DR

Vector stores and RAG pipelines are table stakes—real agent intelligence requires a continuous, multi-modal data substrate
Intelligent systems need five categories of data: episodic, semantic, relational, temporal, and contextual
Most 'agent demos' work because the demo data is curated; production agents starve because nobody built the data layer
The competitive moat in AI isn't the model—it's the richness of the data feeding it

Everyone's building agents. Almost nobody is building what agents need to be useful.

Open any AI newsletter from the last six months and you'll see the same pattern: new agent framework, new agent runtime, new agent protocol. The execution layer is getting extraordinary attention. MCP. A2A. Tool-use benchmarks. Reasoning loops. Multi-agent orchestration.

But here's what I keep running into when I look at production deployments: the agent is smart enough. The data underneath it isn't.

If you've read my AI-Native Computer series, you'll remember I mapped the new AI stack to the old computer model: LLMs as CPU, tokens as bytes, context window as RAM, and data + tools as disk. That framing resonated. But I left "disk" underspecified—Knowledge + Tools, I called it. A convenient hand-wave.

This series corrects that. Because the disk layer—what I'm calling the data substrate—is where the real bottleneck lives. And it's where almost nobody is investing.

The RAG Plateau

Let's start with what most teams actually build.

You have documents. You chunk them. You embed them. You store them in a vector database. When a user asks a question, you retrieve the top-k chunks, stuff them into context, and let the model synthesize an answer.

This is RAG. It works. It works well enough to demo. It works well enough to get funding. And then it hits a wall.

The wall isn't retrieval quality—though that matters. The wall is data richness. Static RAG over a document corpus gives you an agent that can answer questions about what you wrote down. It cannot tell you:

What happened in yesterday's meetings
Who you've been talking to most this quarter
Which commitments from last week are overdue
What changed in the project since you last looked
What you should prepare for before tomorrow's 9 AM

Those aren't retrieval problems. Those are substrate problems. The data simply doesn't exist in a form the agent can consume.

I call this the RAG Plateau: the point where adding more documents to your vector store stops making your agent meaningfully smarter. You can tune your chunking strategy, upgrade your embedding model, add reranking—and you'll get marginal gains. But the ceiling is set by the richness of what you're indexing, not how well you're searching it.

To break through the plateau, you need a different kind of data layer entirely.

The Data Taxonomy for Intelligent Systems

After building MemoryOS—a system that continuously collects, indexes, and serves data to AI agents from my own work life—I've landed on a taxonomy of what intelligent systems actually need. Not what's nice to have. What's required for an agent to move from "impressive demo" to "useful system."

Five categories:

Five Data Types, One Substrate

Hover each node to see example data sources

EpisodicWhat Happened

SemanticWhat Things Mean

RelationalWho Connects to What

TemporalWhen Things Change

ContextualWhat Matters Now

Episodic Data: What Happened

This is the event stream of your world. Meetings you attended. Emails you sent and received. Messages exchanged. Screen activity. Documents you touched. Audio you heard.

Episodic data is the hardest to collect and the most valuable to have. It's what gives an agent a sense of narrative—not just facts, but sequence, cause, and context.

Most organizations have fragments of this scattered across Outlook, Teams, Slack, Zoom recordings, and calendar systems. Almost none of it is unified, indexed, or queryable by an agent.

Semantic Data: What Things Mean

This is the traditional RAG territory: documents, knowledge bases, wikis, policies, reports. The things you've written down or that someone else wrote down for you.

Semantic data is the most common starting point. It's also the least differentiating. Every team that does RAG starts here. The question isn't whether you have it—it's whether you have everything else alongside it.

Relational Data: Who Connects to What

People, projects, teams, organizations—and the connections between them. Who reports to whom. Who interacts with whom. Which projects share stakeholders. Which decisions involved which people.

This is knowledge graph territory. Entity resolution. Relationship extraction. Interaction frequency. Without relational data, your agent can answer "what did the email say?" but not "who are the key stakeholders on this initiative, and when did I last talk to each of them?"

Temporal Data: When Things Happened and How They Change

Calendars. Timelines. Version history. Trend lines. The temporal dimension is what turns a snapshot into a story.

Temporal awareness lets an agent say "this metric has been declining for three weeks" instead of just "the current value is X." It's what enables proactive intelligence—which we'll explore in Part 3 of this series.

Contextual Data: What Matters Right Now

This is the synthesis layer. Rolling summaries. Priority rankings. Hot/warm/cold tiering. The curated, compressed view of "what's relevant to this person at this moment."

Contextual data is generated, not collected. It's the output of running continuous intelligence over the other four categories. And it's what makes the difference between an agent that dumps everything it knows and one that surfaces what actually matters.

Why This Taxonomy Matters

Here's the uncomfortable truth: most production AI systems today have decent semantic data and almost nothing else.

They can search your documents. They can't tell you what happened yesterday. They can't map your relationships. They can't detect temporal patterns. They can't prioritize what's urgent versus what's routine.

That's not a model problem. That's a data substrate problem.

And it's why the demos always look better than the deployments. Demo data is curated—clean, complete, well-structured. Production data is messy, fragmented, and spread across fifteen systems that don't talk to each other.

Building the Substrate: A Case Study

I've been building a system called MemoryOS to test these ideas in practice. It's a three-layer architecture: collection, memory, and interface.

MemoryOS Pipeline

Continuous collection → structured memory → contextual interfaces

Interface Layer

Cursor

CLI

Chat

Dashboard

Agents

auto-generated context

Memory Layer

SQLite FTS5

Hybrid Search

Hot/Warm/Cold

sub-ms search

Obsidian Vault

Collection Layer

Screen/OCR

Calendar

Graph API

OneDrive

Teams

Audio

5 min cycles

View on GitHub

The Collection Layer

Seven extractors run continuously, each targeting a different data source:

Screen activity — OCR + audio transcription via Screenpipe, capturing what I see and hear throughout the day
Email — macOS Mail.app via AppleScript, zero-config extraction
Calendar — macOS Calendar.app, meeting schedules and attendees
Microsoft Graph — Cloud email and calendar for M365 environments
OneDrive — Document conversion (docx, pptx, pdf, xlsx) via Pandoc
Teams chat — Extracted from screen OCR of Teams conversations

Each extractor is idempotent—runs every 5 minutes, processes only what's new since the last run, and writes structured Markdown into an Obsidian vault. The vault becomes the single source of truth: a file system organized by date and type that any tool can read.

The key design choice: Markdown as the universal format. Not a proprietary database. Not a vendor-locked API. Plain text files that are human-readable, version-controllable, and consumable by any AI system.

The Memory Layer

Raw collection isn't enough. You need search, ranking, and synthesis.

The memory layer is a SQLite FTS5 index with hybrid search: full-text keyword search combined with vector embeddings (sentence-transformers, 384-dim, running locally). Results are ranked using Reciprocal Rank Fusion—combining FTS5 relevance, embedding similarity, and temporal tier boosting.

The temporal tiering is simple but effective:

Hot (0–7 days): 2x search boost — current work, today's context
Warm (7–90 days): 1x boost — recent history, still relevant
Cold (90+ days): 0.5x boost — archive, searchable but deprioritized

Documents naturally age from hot to warm to cold. No manual curation required. The system reflects the reality that yesterday's meeting matters more than last quarter's—but last quarter's is still there when you need it.

The Context Layer

On top of search sits the context generator: auto-produced summaries that agents can consume directly.

today.md — today's meetings, emails, activity, Teams chats
this_week.md — rolling 7-day summary by source type
recent_emails.md — last 50 emails with previews
upcoming.md — next 7 days of calendar events

These files are regenerated every 5 minutes but only written to disk when content changes. They pull from the index (fast) rather than re-scanning the vault. Any agent—Cursor, a custom chatbot, an autonomous skill—can read these files and immediately have situational awareness.

This is contextual data in practice: continuously generated, always current, and designed to answer the question "what should I know right now?" without requiring the agent to figure that out from raw data.

The Data Gravity Problem

Here's what I've learned from running this system for months: data gravity is real, and it compounds.

The more data you collect, the more useful every query becomes. The more useful queries become, the more you invest in collection. The more you invest in collection, the richer the substrate gets. It's a flywheel—but only if you start it.

Most organizations never start it because the data layer is invisible work. Nobody gets promoted for building an extractor pipeline. The reward comes later, when agents built on top of that pipeline do things that seem magical—but are really just well-fed.

The organizations that will win the AI era aren't the ones with the best models. They're the ones with the richest, most continuous, most queryable data substrates. The model is commodity infrastructure. The data is the moat.

What This Means for What Comes Next

If the data layer is the foundation, what sits on top of it?

That's where agent runtimes come in. In 2026, a new class of software is emerging—not agent frameworks you import into your code, but agent operating systems that manage the full lifecycle of autonomous agents: memory, security, communication, tool execution, and governance.

ZeroClaw. OpenFang. OpenClaw. Three different architectures, three different philosophies, one shared insight: agents need an operating system, not just a library.

That's Part 2.

The Autonomous Stack

Part 1 of 4

Part 2: The Runtime Wars