The Model Pulse
Issue archive.
A weekly read on the software side of the AI stack: model lineage, architecture, benchmarks, and vendor signals.
Each issue runs to about six minutes and is anchored to the LLM Evolutionary Tree, which the brief annotates each week. The cross-stack flywheel (capital, hardware, networking) lives in The AI Stack Weekly.
Subscribe via RSSPublished issues.
- Issue 07/Week 23 of 2026/
No new closed frontier shipped; the open/local agent substrate widened underneath it.
W23 did not produce the expected Gemini 3.5 Pro GA or a fresh Anthropic/OpenAI frontier release. That absence is the story: Claude Opus 4.8 remains the public closed-frontier leader for coding and agentic work, while Google kept Pro in the June watch window and the model layer's actual shipping activity moved down-stack. JetBrains released Mellum2, an Apache-2.0 12B/2.5B-active MoE designed for low-latency routing, RAG, summarization, validation, and sub-agent calls; NVIDIA released Cosmos 3 as an open physical-AI omni-model with Nano 16B and Super 64B variants; H Company released Holo3.1 with local computer-use sizes and quantized checkpoints. The procurement implication is sharper than another leaderboard reshuffle: production agent systems are becoming portfolios of models. Keep Opus/GPT/Gemini-class models for high-risk reasoning and codebase-scale orchestration, but push cheap, private, repeated sub-agent work into specialized open/local models. The tree delta therefore adds efficient-agent and physical-AI nodes rather than another general chatbot crown.
- Issue 06/Week 22 of 2026/
A quiet release week, a loud leaderboard: Anthropic retook the frontier with Opus 4.8.
W22 was the inverse of W21 — almost nothing shipped, but the one thing that did reset the top of the board. Claude Opus 4.8 (May 28) retook #1 on the Artificial Analysis Intelligence Index at 61.4, edging GPT-5.5's 60.2 and opening a >10-point SWE-Bench Pro lead (69.2%) over the rest of the generally-available field — Anthropic's first public #1 since GPT-5.5 launched in April. Crucially the win came with no list-price change ($5/$25 per Mtok held flat) while fast mode dropped ~3x and the cacheable-prompt minimum fell to 1,024 tokens, so buyers got capability and unit-economics in one 41-day cycle. The week's loudest 'delta' was a non-event: Gemini 3.5 Pro did not ship and slipped to its June GA window, leaving Opus 4.8 unchallenged at the closed frontier. Open weights were genuinely empty in-window — the only adjacent drop was NVIDIA's tri-mode Nemotron-Labs-Diffusion (May 23, just before the window opened) — while DeepSeek made its 75% V4-Pro price cut permanent, removing the May 31 cost-cliff buyers had modeled. The read for procurement: re-baseline coding-agent evals on Opus 4.8 now, but pin effort levels before you do, because effort control and parallel-subagent orchestration mean a single SKU now spans a wide cost/quality curve.
- Issue 05/Week 21 of 2026/
Six new tree rows. The loudest model-layer week of Q2 — Gemini 3.5 + Omni, Qwen3.7-Max in the global top 5, three open-weights drops that reset what 'open' means.
W21 (May 18-23, 2026) was the loudest model-layer week of Q2. Google's I/O turned out larger than expected — the long-anticipated 'Gemini 3.2 Flash' shipped as Gemini 3.5 Flash, paired with native multimodal video generation in Omni Flash and the Antigravity 2.0 standalone agent IDE; Alibaba's same-week Cloud Summit pushed Qwen3.7-Max into the global AA Intelligence Index top 5 (a first for a Chinese model at 56.6, ahead of Gemini 3.5 Flash); three open-weights drops — Cohere Command A+ (first Apache 2.0 frontier-adjacent MoE), Microsoft Fara1.5 (browser agents that out-score OpenAI Operator and Gemini Computer Use), Tencent Hy-MT2 (translation MoE that runs offline on a phone) — reset what 'open' means at the frontier-adjacent tier. The structural shift underneath all six launches is the same: every flagship was engineered for sustained agentic loops first and chat second, with 1M-token context now table stakes and speed-per-task on the Pareto frontier replacing peak-intelligence as the procurement metric. Gemini 3.5 Flash hits Terminal-Bench 76.2% / MCP Atlas 83.6% at ~280 output tokens per second — outperforming prior-gen Gemini 3.1 Pro on agentic benchmarks at 4x throughput. Qwen3.7-Max demonstrated a 35-hour autonomous run with 1,158 tool calls. The unit of competition shifted from raw intelligence to runtime monetization and ecosystem lock-in. NVIDIA's Q1 FY27 print (May 20: $81.6B revenue, Vera Rubin announced for Q3 ship, Vera CPU already in OpenAI / Anthropic / Oracle / xAI hands, $145B supply commitments) confirmed the compute supply chain that makes all this possible is still ramping, not peaking. Huang named Anthropic alongside the hyperscalers as a Blackwell deployment customer — the compute-side demand matches the equity bid behind Anthropic's $30B / $900B round closing imminently per Bloomberg (May 22). The specialist canopy widened structurally. Cohere Command A+ (218B / 25B active MoE, Apache 2.0, runs on 2x H100s) closed the open-vs-closed gap on enterprise RAG and tool-use with τ²-Bench Telecom jumping 37% to 85% generation-over-generation; AA-Omniscience Non-Hallucination #1 at 86%. Microsoft Fara1.5-27B (open-weight, Qwen3.5 fine-tune) scored 72% Online-Mind2Web — beating OpenAI Operator (58.3%) and Gemini Computer Use (57.3%) — collapsing the per-task cost for browser-agent fleets. Tencent Hy-MT2 at 30B-A3B runs offline on a phone after 1.25-bit quantization. Notable absences: no DeepSeek R3, no Kimi K3, no Llama 5, no Grok 4.4 / 4.5, and Anthropic's I/O-week response was infrastructure (self-hosted Managed Agent sandboxes, MCP tunnels) rather than a new model. For the May 25 - June 13 window, the load-bearing catalysts are Anthropic's round close, the Samsung-union ratification vote May 27-28, Microsoft Build (June 2-3), Computex Taipei + NVIDIA GTC Taipei (June 1-5), and any frontier-text counter-launch from Anthropic / OpenAI.
- Issue 04/Week 20 of 2026/
Frontier text took a breath; the specialist canopy widened — video, on-device multimodal, open world-models — and agent-runtime monetization became the story.
Frontier text models took a breath this week — no new GPT-class or Claude refresh between W19 and Google I/O (May 19-20), and LMArena top-of-board moved less than 1 Elo. But the specialist canopy widened sharply, with three model rows joining the tree across video / embodied, on-device multimodal, and open-source world modeling. Perceptron Mk1 collapsed video and embodied reasoning costs by 80-90% on May 12, pricing frontier-grade physical-AI perception under Gemini Flash Lite ($0.15 / $1.50 per Mtok, 85.1 EmbSpatialBench, 72.4 RefSpatialBench) and forcing a re-cost of any 2H roadmap that ships robotics, surveillance, or screen-watching agents. NVIDIA's open-weights SANA-WM (May 15, Apache 2.0) put a minute-scale 720p world model on a single RTX 5090 in 34 seconds — robotics, AV simulation, and synthetic-data teams can now in-house what they previously rented from closed video APIs. OpenBMB's MiniCPM-V 4.6 1.3B (May 11) reset the low end of multimodal SLMs at 262K context with native iOS / Android / HarmonyOS deployment, compressing on-device feature timelines from quarters to weeks. On the platform layer, the unit of competition shifted from model intelligence to agent-runtime monetization. Anthropic's Code with Claude conference (May 6-12) made Managed Agents — Dreaming (between-session memory consolidation), Multiagent Orchestration, Outcomes (rubric-graded), and signed Webhooks — a hosted service; Claude Platform on AWS hit GA May 11; xAI shipped Grok Build CLI on May 14 with a SuperGrok Heavy tier at $300/mo; OpenAI's realtime voice trio (GPT-Realtime-2 / Translate / Whisper, added W19) went broadly available May 11 with 128K context and adjustable reasoning effort. Architects building bespoke agent stacks should justify build vs buy explicitly in 2H plans. The W18 Pentagon vendor-disqualification thread compounded. UK AISI published paired evaluations May 13 confirming autonomous offensive-cyber capability is now doubling every 4.7 months (aligned with METR's 4.2-month SWE figure), and Anthropic emailed Max-20x subscribers a June 15 policy that moves Claude Code third-party agents off subscription rate limits — 12x-175x effective price increase per workload. Capability gating, federal procurement, and pricing structure are now durable vendor-risk dimensions; bench scores alone no longer settle a procurement decision. For the May 19-26 window, the load-bearing catalysts are Google I/O 2026 (Gemini 3.2 Flash / Pro confirmation, fabric session), NVIDIA Q1 FY27 (May 20), Microsoft Build (May 19-22), the Samsung HBM4 walkout starting May 21, and any Anthropic / OpenAI counter-launch following I/O.
- Issue 03/Week 19 of 2026/
Frontier text went quiet. Voice crossed into chain-of-thought. The week's biggest capability move was a compute deal.
W19 was the week the model layer split. Frontier text shifted into a maintenance cadence — no new flagship landed, the AA Intelligence Index ceiling held at GPT-5.5 (xhigh) = 60, and OpenAI's notable release was a tuned variant: GPT-5.5 Instant became the new ChatGPT default with 52.5% fewer hallucinations on high-stakes prompts. Google's Gemini 3.2 Flash surfaced in the iOS app and AI Studio without an announcement two weeks before I/O — leak-grade only. The center of gravity moved sideways into voice and into compute supply. OpenAI's Realtime trio (GPT-Realtime-2 + Translate + Whisper) shipped on May 7 with GPT-5-class reasoning baked into the audio stack — Big Bench Audio jumped 15.2 points, Audio MultiChallenge instruction-following 13.8 points. Inworld shipped Realtime TTS-2 on the same day with conversational turn-awareness and emotional steering. 'Voice model' is now a frontier tier with its own benchmarks (Big Bench Audio, Audio MultiChallenge, Speech Arena Elo), not a checkbox feature. The most consequential capability move was a compute deal. Anthropic + SpaceX (May 6) — 300+ MW at Colossus 1 in Memphis, ~220K NVIDIA GPUs within the month — paired with same-morning announcements that doubled Claude Code 5-hour rate limits, removed peak-hours throttling, and raised Opus API limits. The W19 read: the binding constraint on frontier deployment has flipped from model capability to power-and-GPU supply. The open frontier hardened on efficiency, not absolute capability. Zyphra ZAYA1-8B (May 6) hits AIME 2025 91.9% with 700M active parameters and Markovian RSA test-time compute — and is the first frontier-class MoE trained end-to-end on AMD MI300X + Pensando + ROCm via IBM Cloud. Allen AI EMO (May 8) shows 12.5% of experts can carry near-full performance via emergent semantic specialization. Three independent labs in two weeks have shipped reasoning-credible MoE models with ≤3B active parameters — architects pricing 70B-class inference should re-validate against the sub-billion-active band before committing 2027 capacity. For the May 12-22 window, the load-bearing catalysts are NVIDIA Q1 FY27 (May 20), Google I/O 2026 (May 19-20), Microsoft Build (May 19-22), the CoreWeave 10-Q (May 11-21), and the Anthropic round close.
- Issue 02/Week 18 of 2026/
Six new tree rows, two distribution earthquakes, and the open-weights ceiling closing to 6 points behind GPT-5.5.
W18 is the model layer's first true cadence test under the new weekly Pulse: six new tree rows across five vendors, two distribution earthquakes (Microsoft / OpenAI exclusivity formally ends, GPT-5.5 + Codex + Bedrock Managed Agents land natively on AWS), and the first independent UK AISI sabotage evaluation showing Opus 4.7 actually got safer than 4.6 while Mythos Preview showed a 65% reasoning-output discrepancy. The procurement read shifted in two ways: closed-frontier intelligence is now decoupled from cloud residency for the first time (Bedrock-native GPT-5.5 closes the Azure-only gap), and the open-weights ceiling closed to 6 points behind GPT-5.5 on the AA Intelligence Index — a year ago that gap was 13. The architecture signal: hybrid attention plus 11x-30x sparse-MoE ratios are now the default at both the open frontier (MiMo, DeepSeek, Kimi) and the specialist edge (OpenAI Privacy Filter, Laguna XS.2, Nemotron Nano Omni), so KV-cache and active-parameter footprint are diverging from total parameter count as the dominant procurement variables. The W17 watchlist's predicted closed-frontier coding response resolved as GPT-5.5 itself replacing GPT-5.3-Codex as the in-product Codex default rather than a separate SKU; Anthropic's response remains Opus 4.7 generally available with no -Codex variant — the era of named coding-tier sub-SKUs may be over for the leading two labs. The Pentagon also weighed in on Mythos before Anthropic did: a Friday May 1 eight-vendor classified-network procurement deal explicitly excluded Anthropic on supply-chain-risk grounds with Mythos cited as a separate national security moment, making capability-gating a federal-procurement disqualifier rather than only a research-org concern. For the May 2-9 window, the load-bearing catalysts are Anthropic's response to the AISI Mythos disclosure, Moonshot / DeepSeek free-token responses to Xiaomi's 100T Orbit Plan, and whether Google breaks its post-Gemini-3.1-Pro silence.
- Issue 01/April 2026 Recap/
April rewrote the floor: open weights crossed the closed frontier, and the safety-gated frontier became a procurement criterion.
April 2026 was the most productive month in frontier model history. Eleven model rows landed in the LLM Evolutionary Tree across eight vendors, including the first open-weights MoE to score frontier-class on SWE-Bench Pro coding (Kimi K2.6 at 58.6 and GLM-5.1 at 58.4, both above GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3). DeepSeek V4 shipped in two configurations under MIT license with 1M-token context and 1.6T total parameters, moving on-prem coding from 'can we?' to 'which workload first?' Anthropic withheld its Mythos flagship on cyber-capability grounds after UK AISI confirmed autonomous offensive capability — an inflection that turns capability gating from a research-org concern into a procurement diligence requirement. The closed frontier responded: GPT-5.5 (Apr 23) and Claude Opus 4.7 (Apr 16) reset the closed-source ceiling, with adaptive-thinking and ultra-long-context as the new defaults. Read together, April was the month the canopy widened on three axes at once: open-vs-closed parity on coding, reasoning-as-default at every tier, and capability gating as a market signal.
The methodology.
- — Every issue starts with the tree delta — the model rows added or updated in
content/llm-tree/models.yamlsince the prior issue. - — Frontier and open-weights releases are tagged by tier and architecture so procurement reads can scan to the relevant class.
- — Architecture watch names patterns that crossed multiple vendors. One pattern, several exemplars, what it changes.
- — Benchmark moves cite public leaderboards and vendor model cards as of publish date.
- — Vendor signals cover the non-release moves: pricing, gating, deprecation, license changes.
- — All sources are public. Independent analysis only.
Operate. Publish. Teach.