brianletort.ai
All Posts

Modes of the LLM OS

A 6-part series on what actually happens when you prompt frontier AI — and how to run your own

LLM OSEnterprise AINeocloudCoreWeaveB200Hosted GPUKimiDeepSeekAI GovernanceToken Economics

Running Your Own LLM OS: The Enterprise Build

Your CEO asks whether you can build your own. The answer is yes. Here is what that actually means — four modes, four stacks from Frontier API to an 8x B200 chassis on your own silicon, the near-frontier OSS shift that changed the calculus, and the control spectrum that cost analysis keeps missing.

May 25, 202619 min read

TL;DR

  • Four enterprise stacks cover the entire spectrum of 2026 options: Frontier API, Managed Cloud (Bedrock/Azure/Vertex/OpenRouter), Neocloud (CoreWeave, Lambda, Crusoe, Nebius), and On-Prem (an 8x B200 chassis you own — pair or rack for scale)
  • The 2026 shift most enterprises have not absorbed: near-frontier open-weight models (Kimi K2.6, DeepSeek V4, Qwen 3, Llama 4, GLM 5) closed the quality gap to 5-15%, which opens Neocloud and On-Prem as real options
  • TCO curves cross predictably — Frontier wins below 20M tokens/day, Neocloud crosses Managed Cloud around 80M/day, On-Prem crosses Neocloud around 250M/day once you fill the capacity
  • You do not pick a stack. You pick a model, which constrains your stack — and cost is only one of the decisions. Control, data sovereignty, weights portability, and physical access are the others
  • The reusable enterprise LLM OS has six layers: control plane, context compiler, token ledger, skill registry, MCP gateway, eval harness. These are your IP regardless of which stack you build on

Every CEO has now asked the question: "can we build our own?"

The answer is yes. The harder question is what "our own" means. Running an LLM OS end-to-end is not buying a model. It is assembling a stack — endpoint, orchestration, retrieval, cowork runtime, governance, identity — against the four modes we have spent the last five posts dissecting. And in 2026 the stack goes all the way from renting an API call to owning a rack of B200s.

This is the final part of the series. It is the one your CFO will read.

The four modes, mapped onto the enterprise stack

Before you can decide what to build, you have to decide what to build for. Each of the four modes puts different demands on the stack.

  • Chat Mode needs a model endpoint. That is it. If all you ever shipped was Chat, every one of the stacks below would work for you.
  • Agent Mode needs the endpoint plus an orchestration layer — something that runs the think-act-observe loop, holds a tool registry, manages state, and enforces stopping criteria. LangGraph, CrewAI, Vercel AI SDK, Claude Agent SDK, or something custom.
  • Deep Research Mode needs the endpoint plus a multi-agent pattern plus retrieval infrastructure. GraphRAG, Vectara, Weaviate, or a custom pipeline. Plus an index over your corpus. Plus a citation store.
  • Cowork Mode needs all of the above plus a cowork runtime (Claude Code, Cursor, Codex — or a homegrown one) plus a skills library plus an MCP gateway plus the identity and authorization layer that makes any of this safe.

Put those four together and you have a stack. The question is which version of the stack — and, before that, which model class you are actually going to run on it.

The model dimension

The most important shift in the last twelve months is the one most enterprises have not costed into their strategy yet. The near-frontier open-weight model class has caught up to within 5–15% of the frontier providers on most enterprise tasks. Up through 2024, choosing open weights meant accepting a 30–50% gap on anything hard. That gap is largely gone.

The names to know, roughly in order of appearance:

  • Kimi K2.6 (Moonshot AI) — a strong generalist with a permissive license, competitive on long-context and multi-step reasoning.
  • DeepSeek V4 (and the reasoning-focused R-series variants) — the cost-efficiency leader, exceptional on code and math benchmarks at a fraction of the frontier price per token.
  • Qwen 3 (Alibaba) — broad family covering dense, MoE, and multimodal variants, particularly strong in Mandarin and technical domains.
  • Llama 4 (Meta) — the most operationally mature OSS family, deepest ecosystem support, longest runway of fine-tuning tooling.
  • GLM 5 (Zhipu) — strong reasoning and agent-tool-use variant that punches above its parameter count.
  • Mistral's open releases — smaller, highly optimized variants that run well on constrained hardware and excel at European language coverage.

I name these not to pick a winner — that shifts quarterly — but to anchor the category. Near-frontier OSS in 2026 means an 800 billion to 2 trillion parameter model (dense or MoE), with a permissive-enough license that an enterprise can deploy it without negotiating a custom agreement, running on hardware the enterprise can rent or own. Pick whichever one is winning the month you stand up the workload; switch when the benchmarks shift.

What "near-frontier" buys you:

  • Deployment portability. The weights run wherever you put them. Managed Cloud, Neocloud, On-Prem — any of the three.
  • Licensing clarity. Most of the permissive-license variants can be deployed without the per-seat or per-token negotiations that frontier providers require for high-volume use.
  • Cost floor. Near-frontier OSS on a stack you run lands 40–70% below the equivalent frontier API call at high utilization.

What it costs you:

  • A real quality gap on the hardest multi-step reasoning, tool-use chains, and long-horizon planning tasks. Not 30%, but enough to matter for the top 10–20% of your enterprise workloads.
  • Weights management: versioning, eval, safety tuning, fine-tune lifecycle. The operational overhead that frontier providers hide inside their API.
  • Eval discipline: you now own the question "is this model good enough for this workflow?" — and you have to answer it with numbers, not vibes.

The key insight this series rests on for the enterprise build:

You do not pick a stack. You pick a model, which constrains your stack. Frontier-model workloads run only on Frontier API or, for a subset of models, Managed Cloud. Near-frontier OSS runs on Managed Cloud, Neocloud, or On-Prem. That relationship is what makes the stack decision tractable — you are not choosing from sixteen combinations, you are choosing from the subset that your model selection already narrowed down.

The four enterprise stacks

Running Your Own LLM OS: Four Enterprise Stacks

Frontier API to owned B200s. Toggle to see which layers you own, which you rent, and where your control floor lands.

Stack

Frontier API

Time to stand up

days

Control level

1 / 5

Governance floor

provider ToS + your logging

Rent everything. Call OpenAI, Anthropic, Google directly. Thinnest stack on your side. Unbeatable at small scale, brutal at large scale.

Typical models on this stack

GPT-5Claude 4Gemini 2.5

Layers — what you own, what you rent

HardwareRent
Model endpointRent
Agent orchestrationRent
Retrieval + KBHybrid
Cowork runtime + MCPRent
Governance + auditBuild
Identity + authzBuild

Total cost of ownership — monthly, across three scales

Small · 10M tok/day

$30K / mo

Medium · 100M tok/day

$300K / mo

Large · 1B tok/day

$3M / mo

TCO curves cross. Frontier wins below ~20M tokens/day. Neocloud and Managed Cloud cross at ~80M/day. On-Prem crosses Neocloud at ~250M/day once you fill the capacity. Cost is one decision. Control is another. Regulated workloads, weights portability, and physical access are not line items on a cloud bill.

1. Frontier API

The thinnest stack. You call OpenAI, Anthropic, Google directly. Your code talks HTTPS to their endpoints. You own almost nothing and rent almost everything. The governance floor is the provider's terms of service plus your own logging.

This is the right first move for most teams. It is fastest to stand up (days, not weeks), cheapest to explore (no committed capacity), and unblocked by procurement. It is also the most exposed: your data leaves your environment, your control plane is somebody else's black box, and your economics track the provider's price list.

Control level: 1 of 5.

Typical TCO at three scales:

  • Small (10M tokens/day): ~$30K per month
  • Medium (100M tokens/day): ~$300K per month
  • Large (1B tokens/day): ~$3M per month

At large scale, Frontier API TCO is brutal. At small scale, it is unbeatable.

2. Managed Cloud

AWS Bedrock, Azure AI, Google Vertex, Vercel AI Gateway, OpenRouter. Rent the hosting, choose the model. Frontier and OSS on the same plane. Data stays inside your cloud account or a VPC-isolated environment. Fine-tunes supported; reserved capacity (Azure PTUs, Bedrock provisioned throughput, Vertex reserved) available when you commit to volume.

This is where the AI gateway pattern lives — one gateway in front of every provider with unified observability, cost tracking, fallbacks, and model routing. It is also the stack that collapses the old Bedrock-versus-Azure-versus-Vertex three-way split into one architectural pattern: you pick a cloud vendor the way you pick a CRM, and the rest is procurement.

Control level: 2 of 5. You control model choice, data boundary, and (sometimes) fine-tune weights. You do not control the hosting, the inference stack, or the underlying hardware.

  • Small: ~$45K per month
  • Medium: ~$260K per month
  • Large: ~$2M per month

Managed Cloud is the right move for regulated enterprises that need data boundaries, hybrid model strategies, or unified observability across providers. Time to stand up: weeks. This is where most of the Fortune 500 operationally lives in 2026.

3. Neocloud

CoreWeave, Lambda, Crusoe, Nebius, Together, Fireworks, Replicate. You rent raw GPU capacity — H100, H200, or B200 — by the hour or by reservation. You run your own serving stack on top: vLLM, SGLang, TensorRT-LLM, or something you built. Hardware is rented; everything above the silicon is yours.

This is what I called "self-hosted vLLM" in earlier drafts of this series. It is more honest to call it Neocloud, because you are still renting — you are just renting hardware instead of inference. The difference matters: on Managed Cloud, the inference runtime is a black box. On Neocloud, it is your code.

Hosted GPU rates as of 2026: H100 runs $2–3 per hour, H200 runs $3–4, B200 and GB200 systems land $6–10 per GPU-equivalent depending on commitment. Sovereign options — in-country GPUs with jurisdiction guarantees — run 20–40% more but are table stakes for certain regulated workloads.

Control level: 4 of 5. You own the software stack top to bottom. The hardware is not yours, but it is dedicated (you are not sharing a GPU with someone else's workload during your committed hours), and the weights live in your account.

  • Small: ~$80K per month (minimum committed GPU reservation hurts you at this scale)
  • Medium: ~$250K per month
  • Large: ~$1.1M per month

Neocloud is where the world's biggest AI workloads have been moving since 2024. It is the stack that the serious AI-first companies run on — if your organization has a dedicated inference-optimization team, this is probably where you belong.

4. On-Prem (owned B200s)

The full stack on silicon you own. An 8x B200 HGX chassis is the standard unit — enough compute to serve near-frontier OSS models at enterprise throughput for a single workload or a small portfolio. Organizations needing more capacity pair nodes together (16 GPUs across a NVSwitch fabric), run multiple chassis in parallel, or step up to a full GB200 NVL72 rack (72 Blackwell GPUs in one NVLink domain) for serious enterprise AI platforms. Hosted either in a colocation facility (the realistic path for most enterprises — Digital Realty is the category example) or in a sovereign-compliant data center you operate.

The numbers to internalize for an 8x B200 HGX node:

  • Capex: ~$400–500K (server + GPUs + NVLink fabric + networking + high-speed storage)
  • Power: ~10 kW sustained draw, often much higher on bursts
  • Cooling: 1.3–1.5x power overhead for air-cooled colo; lower for liquid
  • Colo space: ~$1–3K per month for rack space + power
  • All-in monthly: ~$30–45K amortized over 3–5 years, including a partial FTE for ops

Pair a second 8x B200 node (16 GPUs total across a NVSwitch fabric) and capex lands around $900K–1.2M. Step up to a full GB200 NVL72 rack and you are in the $3–4M capex range per rack. These numbers are indicative for 2026 pricing; your actual TCO depends on utilization, power rates, and how much of the ops work you staff internally.

Control level: 5 of 5. Weights, hardware, network, power, physical access, export controls — all yours. This is the only stack that answers "yes" to every sovereignty and physical-access question.

Typical TCO at three scales, all-in (capex amortized + opex + labor):

  • Small (10M tokens/day): ~$170K per month (small cluster, poorly utilized)
  • Medium (100M tokens/day): ~$210K per month (multiple 8x B200 nodes, ~50% utilization)
  • Large (1B tokens/day): ~$800K per month (multi-rack, 70% utilization)

The TCO curve on On-Prem is the shallowest of the four stacks — the capex is fixed, so cost per token drops fastest as you fill the cluster. At 1B tokens/day with good utilization, On-Prem is roughly 75% cheaper than Frontier API for the same workload.

On-Prem is the right stack when one of three things is true: your utilization is high enough to amortize the capex, your data sovereignty or regulatory requirements forbid any of the rented options, or your organization has enough workloads that a shared GPU platform becomes a shared internal service. If none of those are true for you, do not go here yet.

Where the TCO curves cross

The TCO curves across these four stacks have predictable crossing points.

At small scale (under ~20M tokens per day), Frontier API wins on every dimension. Do not overthink it.

At medium scale (around 80M tokens per day), Managed Cloud and Neocloud cross. If your workloads are standardizing and you have the skills to run your own serving stack, Neocloud starts paying for itself here.

At larger scale (around 250M tokens per day, utilization-dependent), On-Prem crosses Neocloud. This is the threshold where owning the hardware begins to beat renting it — but only if you can fill the capacity.

At 1B tokens per day and above, On-Prem is the cheapest option by a wide margin (roughly 75% less than Frontier API), followed by Neocloud, then Managed Cloud, then Frontier. The gap between the top and the bottom of the stack is a full 3–4x on cost.

The middle band — 20M to 500M tokens per day — is where most enterprises sit, and where the decision is least clean. This is also the band where the answer usually should be multiple stacks, not one. Chat traffic on Frontier. Agent orchestration on Managed Cloud with OSS models. Sensitive workloads on On-Prem. One stack per mode is not a failure of strategy; it is a mature recognition that each mode has different economics.

The control spectrum

Cost is one decision. Control is a different decision. The two overlap, but they are not the same, and the things that distinguish them are the things that cost analysis tends to miss:

  • Data sovereignty. Regulated workloads — healthcare PHI, federal ITAR-adjacent, EU regulated data, financial-services material non-public information — force specific hosting. That is not a cost decision; it is a permission decision. Some workloads simply cannot run on Frontier API at any price. Neocloud and On-Prem are the only answers for the hardest cases.
  • Weights portability. Frontier API and Managed Cloud give you no weights to take with you. If your fine-tunes are your moat, you need a stack that lets the weights live with you — which means Neocloud (where weights live in your storage) or On-Prem (where they live on your hardware).
  • Physical access and export risk. For certain threat models, the question is not "who logs in" but "who has physical access to the silicon." Only On-Prem gets you a full answer. For most enterprises, this does not matter. For some — national-security-adjacent, critical infrastructure, classified data — it is everything.

The practical implication: the right answer for most enterprises is not a single stack. It is the right stack per mode per workload under the right governance floor. A sophisticated 2026 enterprise AI portfolio runs exploratory Chat on Frontier, production Agent work on Managed Cloud, high-volume inference on Neocloud, and sovereign or regulated workloads on On-Prem — on the same control plane, with a single token ledger, accounted and governed together.

The reusable enterprise LLM OS

If you strip the stack-specific details away, every serious enterprise LLM OS has the same six reusable layers. These are the layers that are your IP regardless of which stack you build on.

  1. The control plane. The surface where modes, models, and policies are registered. Every inference call in the enterprise flows through this. It is where routing decisions happen, where rate limits live, where provider fallback logic runs. I wrote about this pattern specifically in Designing the AI Control Plane.
  2. The context compiler. The layer that assembles a Context IR from the user's request, retrieves, de-duplicates, filters, and budgets. The same layer that separates "retrieval systems" from "context compilers" in the Context Compilation Theory papers.
  3. The token ledger. Every call logged with who, what mode, how many tokens of each type, at what rate, against which budget. The token scorecard pattern made operational. Without this, you cannot manage the economics.
  4. The skill registry. Where your organizational behaviors — skills, custom instructions, rule sets — live. Named, versioned, reviewed, governed. Shared across cowork products.
  5. The MCP gateway. The single controlled boundary between cowork runtimes and internal systems. Scoped credentials, rate limits, logged calls, auditable requests. Every enterprise MCP integration goes through this.
  6. The eval harness. Ongoing evaluation — correctness, latency, cost, safety — against every mode the enterprise runs. Tests that run on every release, dashboards that CTOs look at on Mondays.

These six layers are yours. They are the parts of the LLM OS that are genuinely enterprise IP, independent of whichever model, cloud, or silicon you use. If your 2026 AI investment is going into these layers, your strategy will survive the next three model releases and the next two hardware generations. If it is going into whichever particular model is winning this quarter, it will not.

Build-vs-buy, mode by mode

Here is the honest matrix. This is not what vendors will tell you; it is what the numbers tell you once you have tried all four stacks.

Chat Mode

  • Small or exploratory: Frontier API. Always. Every time.
  • Regulated or high-volume: Managed Cloud with model choice across frontier and near-frontier OSS.
  • At scale with predictable load: Neocloud running Kimi K2.6 / DeepSeek V4 / Llama 4 / Qwen 3, or On-Prem if you have the utilization and the sovereignty requirement.

Agent Mode

  • Orchestration: almost always build. LangGraph, Vercel AI SDK, Claude Agent SDK, or custom. The orchestration layer is your IP. It encodes how your enterprise uses agents.
  • Run ledger: always build. No vendor offers this at the shape you need.
  • Model: depends on cost sensitivity and how much tool-use quality matters. Frontier reasoning models still lead in tool-use reliability in 2026. Near-frontier OSS is viable for well-scoped agents, especially when you control the tool registry tightly.

Deep Research Mode

  • The product: buy if your use case is generic. OpenAI, Gemini, Claude, Perplexity are mature products and hard to beat on pure research quality.
  • Scoped to your data: build. Your KB-aware deep research is worth owning — it is where the audit trail and data sovereignty live.

Cowork Mode

  • The coworker product: buy. Cursor, Claude Code, Codex, ChatGPT Projects, Operator are sophisticated products that are accelerating much faster than most enterprises can build.
  • The skill library: build. Skills are the packaging of your institutional knowledge — named, versioned, reviewed behaviors. The library is yours.
  • SaaS API connectors: buy. Most 2026 enterprise SaaS ships MCP servers or equivalent connectors — Salesforce, ServiceNow, GitHub, Slack, Jira, Confluence, Workday, Snowflake. Use what the vendor ships, scope it, govern it.
  • Custom API connectors: build. The MCP servers that front your internal apps — order management, billing, claims, inventory, whatever your domain is — are your IP. Your skills hit these through the gateway.
  • The MCP gateway: build. The unified boundary between cowork runtimes and everything above — SaaS connectors, custom connectors, all of it. Scoped credentials, rate limits, logged calls, audit trail.
  • The governance layer: build. See the six reusable layers above.

The architectural point worth emphasizing: skills are the what your coworker does. SaaS and custom API connectors are the how they reach your systems. MCP is the protocol between them. The gateway is the one place you govern all three. Miss any of those layers and the cowork stack either cannot do the work (no connectors), cannot share the work (no skills), or cannot be governed (no gateway).

The pattern: buy the consumer-facing products and the SaaS connectors; build the skills, the custom connectors, the gateway, and the control plane. This is the opposite of what most enterprises default to, and it is the reason most enterprise AI programs are over-invested in UX and under-invested in infrastructure.

The governance close

I have spent most of the last decade thinking about what makes enterprise systems durable. In every previous technology wave — mainframe to client-server, on-prem to cloud, SQL to NoSQL, monolith to microservices — the pattern has held: the thing that survives is the governance layer, not the execution layer.

LLM OS is no exception. The models will change. The products will change. The providers will consolidate and fragment and consolidate again. The B200 will be eclipsed by the B300 and whatever follows. The stack you pick in 2026 will not be the stack you run in 2028.

What will survive is the control plane, the context compiler, the token ledger, the skill registry, the MCP gateway, the eval harness, the cybernetic software delivery lifecycle wrapped around it, and the people who understand how the four modes compose into operational work. Those are the durable assets.

The model was never the investment. The operating system is.

What to do on Monday

Four final moves that compound across the entire series.

  1. Pick your stack per mode, not globally. The enterprise that runs Chat on Frontier, Agent on Managed Cloud with OSS, Deep Research scoped to a governed KB, and Cowork behind a controlled MCP gateway is not confused. It is mature. The one-stack-to-rule-them-all instinct is usually a procurement preference, not an architecture decision.
  2. Invest in the six reusable layers first, not the model. If your 2026 AI budget has more dollars going into model subscriptions than into control plane, context compiler, token ledger, skill registry, MCP gateway, and eval harness, you are funding the wrong half of the stack. Flip the ratio.
  3. Separate your cost decision from your control decision. Name the workloads that need each. Regulated workloads, weights portability, and physical access are not line items on a cloud bill — they are permissions, and they choose the stack before cost does. Your control floor for healthcare PHI or sovereign-data workloads is not Frontier API, no matter what the TCO says.
  4. Publish a mode-by-mode governance standard for your organization. One page. Which modes are allowed where. Which credentials. Which audit. Which approvals. Make it boring, make it specific, and make it live. That document is what separates AI you can defend from AI you cannot.

Closing the series

Six parts, five modes of execution (Chat, Agent, Deep Research, Cowork, and the enterprise build that composes them), and one consistent argument: the LLM is an operating system, and the enterprise work is mapping the modes onto a stack you can actually operate.

You cannot govern "AI" at the enterprise. You govern the modes.

You cannot budget "AI." You budget the modes.

You cannot architect for "AI." You architect the composition of modes against your data, your people, your control floor, and your outcomes.

Thank you for reading. The rest of the work is now in your hands and your organization's hands — where it has always belonged.

Operate. Publish. Teach.

Modes of the LLM OS

Part 6 of 6