Brian Letort | AI Architect & Data Center Innovator

A lot of "agent" talk still sounds like old SaaS talk. People ask about the UI, where users click, and which model to pick. Those questions matter sometimes. But they aren't the center of gravity in production.

In the real world, agents aren't apps you ship. They're workers embedded in your business capability graph. They show up when something happens—a ticket escalates, a payment fails, a contract renews—take constrained actions, leave an audit trail, and either resolve the work or hand it off.

If you design agents as apps, you'll still ship something. It just won't scale the way you want. The enterprise pattern that holds up is simpler: the core is stochastic (LLMs reason probabilistically), and the shell is deterministic (policies, contracts, workflows, logs, fallbacks).

You don't "trust" the agent. You bound it.

Agents Are Capabilities, Not Applications

The fastest way to stall an agent initiative is to treat the agent like a product you ship instead of a capability you operate. Applications are destinations—users navigate to them, click around, and leave. Capabilities are verbs—things the business can do repeatedly with predictable outcomes.

Think about the difference. When someone says "we're building an agent app," they usually mean a chat interface with some tools attached. When someone says "we're building agent capabilities," they mean reliable, measurable functions wired into how the business actually operates: reconciling invoices, triaging support tickets, routing security incidents, verifying vendor compliance, drafting renewal quotes, detecting and responding to failed payments.

An "agent app" can be a great demo. A capability is a system.

That distinction changes everything about the design. The interface is often an event stream, not a chat box. Success is throughput, accuracy, time-to-resolution, and risk—not novelty. The customer is the process owner, not the person who likes the UI. The deployable unit is a controlled workflow with permissions, not "an agent."

This isn't anti-application. Most agent capabilities still need front ends, dashboards, and integrations. The point is that UI is rarely the core of the system. The core is the capability and its control plane. If you've read my earlier piece on the AI-native computer, this connects directly: agents are the new "applications" in that computer, and they're composed of capabilities that interact with your knowledge layer and tools.

The Stochastic Core and Deterministic Shell

LLMs are not deterministic, and they never will be in the way enterprises mean it. Even when outputs look consistent, failure modes show up under load, ambiguity, and edge cases. So the question isn't "How do we make the model trustworthy?" It's "What do we wrap around the model so the system is trustworthy?"

The LLM is genuinely great at certain things: interpreting ambiguous inputs, classifying intent, extracting entities, proposing plans, selecting tools, and summarizing outcomes. These are the tasks where probabilistic reasoning adds real value—where you need flexibility and judgment rather than rigid rules.

But the LLM should not be the thing that decides permissions, executes high-risk actions directly, defines what "done" means, logs its own behavior, or decides whether it behaved safely. That's the shell's job.

A reliable shell usually includes four layers: policy and identity (who can do what, on which data, under what conditions), tool contracts (typed inputs/outputs with clear side-effect boundaries), observability (traces, tool calls, decisions, errors, recovery paths), and fallbacks (safe degradation, approvals, escalation, rollback). If the only safety mechanism is "prompting harder," you're relying on the least reliable part of the system.

A Production Blueprint: The Agent Control Loop

Most production agents follow the same control loop, and it's worth internalizing because it applies regardless of which framework or orchestration tool you use.

It starts with a trigger—an event arrives via webhook, queue, or schedule. Then comes context building—gathering only what matters, not "everything." This is where my earlier writing on context engineering comes into play: the context window is the new RAM, and what you put in it determines what the model can reason about.

Next, the LLM plans—proposing steps and selecting tools. This is the stochastic part. Then a policy check validates proposed actions against rules and identity. If it passes, the agent acts by calling tools via deterministic APIs. Observe records decisions, tool calls, outputs, and errors. Verify runs checks—validators, invariants, or secondary models. Finally, decide—complete, retry, escalate, or request approval.

The key insight is that this loop stays deterministic even if the planner is not. The LLM sits inside one step. Everything else enforces structure.

The Human-in-the-Loop Spectrum

The most important shift with AI agents isn't autonomy. It's how work gets decomposed.

When you stop treating agents like apps and start treating them like executable business capabilities, a few things change: work becomes composable, outcomes become observable, governance moves to runtime, and AI stops being a black box.

In practice, this isn't about "fully autonomous" versus "human-only" systems. It's a spectrum. Some capabilities run automatically. Some require human-in-the-loop decisions. Some intentionally pause, escalate, or collect input before acting.

That's why agentic systems that actually work look less like one smart agent and more like a capability graph: distinct functions, explicit inputs and outputs, clear authority boundaries, and control surfaces that decide when to act—and when not to.

The spectrum runs from fully automatic (low-risk, high-volume, reversible operations that execute without human involvement) through act-and-notify (autonomous action with async human awareness), pause-and-collect (gathering additional context mid-flow), recommend-and-approve (human signs off before execution), all the way to human-driven (AI assists, human leads).

The real question isn't "Can an agent do this?" It's "Should this capability act automatically, or should a human be in the loop?" That question gets answered differently for every capability—and it should. The answer depends on risk, reversibility, confidence, and the value of human judgment in that specific context.

Across active R&D and rapid iteration, this model has proven far more durable than treating agents as applications. The capability graph approach lets you dial each function to the right level of autonomy independently, rather than making a single global decision about how much to "trust the AI."

The Four Layers of the Deterministic Shell

Policy and Identity

This is where many projects fail quietly. They connect an agent to internal systems and call it innovation. In production, every action is an authorization decision.

The patterns that hold up in practice: separate "recommend" tools from "act" tools so you can grant read access broadly while controlling writes tightly. Enforce least privilege per capability, not per agent—a single agent might need different permissions for different operations. Use short-lived scoped credentials per run, not long-lived service accounts. Require approvals for high-risk actions like refunds, deletions, and escalations.

Tool Contracts (Typed, Testable, Boring)

Tools are where agents become real. Tools are also where agents can break things. A tool should be a contract, not a suggestion.

In TypeScript, treat tools like typed endpoints with explicit schemas:

interface ToolContract {
  name: string;
  description: string;
  inputSchema: z.ZodType;
  outputSchema: z.ZodType;
  sideEffects: 'read' | 'write' | 'delete';
  riskTier: 0 | 1 | 2 | 3 | 4;
  requiresApproval: boolean;
  timeout: number;
  retryPolicy: RetryConfig;
}

// Example: A bounded action tool
const updateTicketStatus: ToolContract = {
  name: 'update_ticket_status',
  description: 'Update the status of a support ticket',
  inputSchema: z.object({
    ticketId: z.string(),
    status: z.enum(['open', 'in_progress', 'resolved', 'closed']),
    resolution: z.string().optional(),
  }),
  outputSchema: z.object({
    success: z.boolean(),
    previousStatus: z.string(),
    newStatus: z.string(),
  }),
  sideEffects: 'write',
  riskTier: 2,
  requiresApproval: false,
  timeout: 5000,
  retryPolicy: { maxRetries: 2, backoff: 'exponential' },
};

The same principle applies in Python. Here's how you might structure it with Pydantic and LangChain:

from pydantic import BaseModel, Field
from typing import Literal
from langchain.tools import StructuredTool

class TicketUpdateInput(BaseModel):
    ticket_id: str = Field(description="The ticket ID to update")
    status: Literal["open", "in_progress", "resolved", "closed"]
    resolution: str | None = Field(default=None)

class TicketUpdateOutput(BaseModel):
    success: bool
    previous_status: str
    new_status: str

def update_ticket_status(ticket_id: str, status: str, resolution: str | None = None) -> dict:
    # Policy check happens BEFORE this function is called
    # The shell validates, this just executes
    previous = get_ticket(ticket_id).status
    set_ticket_status(ticket_id, status, resolution)
    return {"success": True, "previous_status": previous, "new_status": status}

update_ticket_tool = StructuredTool.from_function(
    func=update_ticket_status,
    name="update_ticket_status",
    description="Update the status of a support ticket",
    args_schema=TicketUpdateInput,
    return_direct=False,
)

A "tool" that accepts free-form text and returns free-form text isn't a tool. It's an accident waiting to happen.

Observability

Enterprises adopt agents when they're operable. The minimum requirements: trace ID per run, structured logs of every tool call (with redaction rules for sensitive data), latency and error rates, retry counts, human approval events, and outcome metrics like time-to-resolution and escalation rate.

One rule that helps: the shell writes the logs—not the model. Don't ask the LLM to describe what it did. Record what actually happened at the infrastructure level.

Fallbacks and Recovery

Every production system fails. The difference is whether it fails safely. The patterns we use: timeouts per step and per run, circuit breakers that stop calling flaky dependencies, safe mode behavior (read-only mode when something goes wrong), human-in-the-loop gates for risky actions, rollback tools where possible, and escalation paths that assign to a queue with full context.

If your agent doesn't have a failure story, it doesn't have a production story.

Risk Tiers: The Only "Trust Model" That Works

Stop asking "Can we trust the agent?" Start asking "What risk tier is this action?"

A simple tiering model that works in practice: Tier 0 (Observe) covers summarizing, classifying, extracting, and drafting—no side effects. Tier 1 (Recommend) proposes actions with evidence but doesn't execute. Tier 2 (Act—Bounded) handles low-risk actions with strict validation. Tier 3 (Act—High Risk) requires approval, dual control, or rollback capability. Tier 4 (Autonomous) is rare—only appropriate when failure is cheap and reversible.

Most enterprises should live in Tier 0–2 for a while. That's not slow. That's sane.

Why "Model Choice" Is Usually the Wrong Obsession

Model choice matters, but it's not your architecture. If the agent is uncontrolled, the best model will still do the wrong thing faster. If the agent is controlled, multiple models can work—and you can swap them as the market changes.

The durable asset is the deterministic shell: policies, tool contracts, instrumentation, governance, runbooks. That's what survives model churn. This connects back to what I wrote about building RAG at scale—the retrieval infrastructure and data quality matter more than which embedding model you pick. The same principle applies to agents: the control plane outlasts any particular model choice.

How to Start: Build One Capability, Not One Agent

If you want to ship something real in 2–4 weeks, don't start with a "general agent." Start with one capability that has a clear definition of "done," a measurable KPI (cycle time, error rate, throughput), bounded actions, accessible data, and a safe fallback path.

Build the shell first. Define the tools. Define the policy. Define the observability. Define the fallback. Then add the stochastic core.

In n8n or similar orchestration tools, that often looks like: a webhook trigger, 2–5 deterministic steps, one LLM step for interpretation/planning, one approval gate, tool calls via HTTP nodes to your services, and logging to your event store.

In a Python codebase with something like LangGraph, the structure is similar:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    context: dict
    proposed_actions: list
    policy_approved: bool
    results: list
    should_escalate: bool

def build_context(state: AgentState) -> AgentState:
    """Deterministic: gather relevant data"""
    context = fetch_relevant_documents(state["messages"][-1])
    return {"context": context}

def plan_actions(state: AgentState) -> AgentState:
    """Stochastic: LLM proposes what to do"""
    proposed = llm.invoke(
        system="You are a support agent. Propose actions based on context.",
        messages=state["messages"],
        context=state["context"]
    )
    return {"proposed_actions": parse_actions(proposed)}

def check_policy(state: AgentState) -> AgentState:
    """Deterministic: validate against rules"""
    for action in state["proposed_actions"]:
        if not policy_allows(action, current_user):
            return {"policy_approved": False, "should_escalate": True}
    return {"policy_approved": True}

def execute_actions(state: AgentState) -> AgentState:
    """Deterministic: run approved actions through typed tools"""
    results = []
    for action in state["proposed_actions"]:
        result = tool_registry.execute(action)
        log_action(action, result)  # Shell writes the logs
        results.append(result)
    return {"results": results}

def should_continue(state: AgentState) -> str:
    if state.get("should_escalate"):
        return "escalate"
    if not state.get("policy_approved"):
        return "denied"
    return "execute"

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("context", build_context)
workflow.add_node("plan", plan_actions)
workflow.add_node("policy", check_policy)
workflow.add_node("execute", execute_actions)
workflow.add_node("escalate", escalate_to_human)

workflow.set_entry_point("context")
workflow.add_edge("context", "plan")
workflow.add_edge("plan", "policy")
workflow.add_conditional_edges("policy", should_continue)
workflow.add_edge("execute", END)

The pattern is the same regardless of the implementation: deterministic steps wrap the stochastic core. The loop is controlled even when the planner isn't.

A Challenge to the Default Narrative

Here's my prediction. The winners won't be the teams with the cleverest prompts. They'll be the teams with the most boring, well-operated control planes.

Agents won't replace applications overnight. But they will change what applications look like. What survives is the capability layer: verbs, contracts, policies, and control surfaces wired into how work actually happens. If you've followed my AI-native computer series, this is the natural next step—once you understand that AI is the new computer, you start asking what reliable "programs" look like on that computer. This is it.

If you're building agent apps, you're early. If you're building agent capabilities with deterministic shells, you're on time.

The organizations that get this right will compound their advantage. Every capability they ship becomes a building block for the next. Every tool contract they define becomes reusable infrastructure. Every policy they encode becomes institutional knowledge. The shell isn't overhead—it's the asset that makes everything else possible.