The Model Pulse

Issue archive.

A weekly read on the software side of the AI stack: model lineage, architecture, benchmarks, and vendor signals.

Each issue runs to about six minutes and is anchored to the LLM Evolutionary Tree, which the brief annotates each week. The cross-stack flywheel (capital, hardware, networking) lives in The AI Stack Weekly.

Subscribe via RSS

Published issues.

Issue 14/Week 30 of 2026/July 25, 2026
Opus-class intelligence just moved into the mid-tier price band, while Google reset the fleet-economics floor
Anthropic's Jul 24 Claude Opus 5 release is not a conventional flagship upgrade. It delivers vendor-reported state-of-the-art coding and knowledge-work results at $5/$25 per million tokens, the same list price as Opus 4.8 and roughly half the price of Claude Fable 5. That compresses the gap between the premium and mid-tier on both capability and price: buyers no longer need to pay the Fable premium by default for hard knowledge work. The model adds adaptive thinking, a beta that lets applications change tool definitions during a conversation without invalidating the prompt cache, and beta automatic API fallbacks when a safety classifier blocks a request. Fast mode is roughly 2.5x faster at 2x the token price, so latency is now an explicit purchasable tier rather than a separate model. Google attacked the same market from below on Jul 21. Gemini 3.6 Flash holds the $1.50 input price of 3.5 Flash, cuts output to $7.50, and Google says it uses about 17% fewer output tokens on comparable tasks. Gemini 3.5 Flash-Lite lands at $0.30/$2.50 and roughly 350 output tokens per second. The important number is effective completed-task cost, not the rate card: fewer generated tokens compound across every reasoning and tool loop in a fleet. Gemini 3.5 Flash Cyber is a different category — a security-specialist model limited to governments and select partners through CodeMender — and should not be mistaken for a generally available procurement option. Gemini 3.5 Pro still has not shipped, while Google says Gemini 4 pretraining has started; the roadmap is moving before the delayed flagship closes its launch. DeepSeek supplied the migration lesson. The legacy deepseek-chat and deepseek-reasoner aliases were retired Jul 24 at 15:59 UTC with no redirect, forcing applications onto deepseek-v4-pro or deepseek-v4-flash and making thinking a parameter rather than a separate endpoint. That resolves only part of last week's prediction: the alias retirement occurred, but the clean GA and pricing evidence required for a full hit remains incomplete. Kimi K3 is the opposite watch item — weights and license remain unpublished as of this issue, with Moonshot's Jul 27 promise still ahead. Net/net: model procurement is becoming a routing problem. Use Opus 5 when the work needs frontier judgment, Flash when fleet economics dominate, and treat availability, safety fallback behavior, token efficiency, and cache continuity as part of the model specification.
Download executive brief (PDF)
Issue 13/Week 29 of 2026/July 18, 2026
Kimi K3 topped an LMArena flagship board, priced itself onto the closed vendors' rate card — and shipped without weights. Meanwhile the largest US-origin open-weights model actually landed, and OpenAI showed what it keeps for itself.
Moonshot's Kimi K3 (Jul 16) is the week's headline and its own cautionary tale. The 2.8T-parameter MoE became the first open-weight-lab model to top an LMArena flagship board (#1 Frontend Code Arena at 1,679 preliminary, ahead of Fable 5 and GPT-5.6 Sol) and debuted #4 on the AA Intelligence Index at 57.1 — but it launched hosted-only, with weights promised by Jul 27 and no license published, so Artificial Analysis classifies it proprietary in the interim. And the price tells the strategic story: $3/$15 per M tokens, flat across the full 1M context, is roughly 3x the K2.6 tier and matches Claude Sonnet 5's standard rate card — though with Sonnet's $2/$10 intro price running through Aug 31, K3 actually sits ~50% above the live street price. The 'end of cheap Chinese AI' read circulated on launch day (the-decoder; Simon Willison called K3 the most expensive Chinese-lab model to date); the procurement translation is ours: budget models on measured cost-per-task, not on lineage — see the AI Stack Weekly's price-band-convergence read for the market-structure claim. The week's actual open-weights event came from Thinking Machines Lab: Inkling (Jul 15), a 975B/41B-active trimodal MoE under clean Apache 2.0 with BF16 and NVFP4 checkpoints and day-0 support in every major serving stack — the largest US-origin open-weights model to date, positioned explicitly as a fine-tuning base rather than a chat flagship. And OpenAI's GPT-Red disclosure (Jul 15) marks the capability-gating precedent to watch: an internal-only red-teaming model trained at flagship post-training scale that beat human red-teamers 84% to 13% on indirect prompt injection and hardened GPT-5.6 Sol to a 0.05% direct-injection failure rate. The operational read across all three: treat 'open' as a license-and-weights fact, not a launch label; make prompt-injection robustness a quantified line item in model selection; and expect the strongest capabilities to increasingly ship as vendor-internal advantages rather than products.
Download executive brief (PDF)
Issue 12/Week 28 of 2026/July 11, 2026
Four frontier releases in four days — and the week's real lesson came from the community: Hy3 went from Apache-2.0 drop to local deployment in 30 hours, while Databricks proved the harness now matters more than the model.
The release cadence hit a new peak: Tencent's Hy3 (Jul 6), Grok 4.5 (Jul 8), and GPT-5.6 GA plus Meta's Muse Spark 1.1 (Jul 9) landed inside four days, each attacking a different axis. GPT-5.6 closed the loop on the government-gating story — full GA after a 12-day CAISI-coordinated preview, with Sol leading the AA Coding Agent Index at 80 but trailing Fable 5 badly on SWE-Bench Pro (64.6% vs 80%), and Terra at $2.50/$15 confirming the closed-lab repricing cycle. Grok 4.5 made cost-per-task the explicit pitch: $2/$6 pricing with a claimed ~4.2x output-token efficiency advantage over Opus 4.8, co-trained with Cursor. But the structural story of the week came from below. Hy3 shipped under clean Apache 2.0 (reversing its April preview's regional exclusions), and the community pipeline that followed — first GGUF quants with 1M context in ~30 hours, a llama.cpp pull request converting the model's MTP layer into speculative-decoding for +40% local throughput, and the colibri project running 744B GLM-5.2 in 25GB of consumer RAM — is the fastest release-to-local cycle recorded for a model this size. Pair that with Databricks' merged-PR benchmark (GLM 5.2 statistically tied with Opus 4.8 at $1.28 vs $1.94 per task; the same model costing 2x more per task in one harness than another) and the procurement conclusion is unavoidable: model choice is becoming a routing decision inside a harness-and-serving strategy, not a vendor commitment. Route by measured cost-per-completed-task, benchmark the harness as an independent variable, and treat the local-deployment channel — which mainstream coverage largely missed this week — as a leading indicator of where open-weight economics land next.
Download executive brief (PDF)
Issue 11/Week 27 of 2026/July 4, 2026
Washington entered the release pipeline: GPT-5.6 launched gated to ~20 approved partners, Fable 5 came back, and Sonnet 5 split sticker price from cost-per-task.
Within five days the US government both gated a new frontier launch and un-gated a suspended one. OpenAI previewed the GPT-5.6 family (Sol, Terra, Luna) on June 26 to roughly 20 government-approved partner organizations — the first US frontier model launched under a government-managed access list — while Commerce withdrew the Anthropic export-control order on June 30 and Claude Fable 5 returned globally on July 1. The procurement implication is direct: frontier availability is now partly a regulatory variable, so contracts and architectures must assume any flagship can be gated at launch or suspended after it, and multi-model routing is availability insurance, not just cost optimization. The week's biggest GA release, Claude Sonnet 5 (June 30, 1M context, 85.2% SWE-bench Verified, default for Claude Free/Pro), delivered the second structural lesson: at max effort its token appetite pushes measured cost to $2.29 per task — ~15% above Opus 4.8 — despite a far lower per-token price, so sticker price and cost-per-task have formally diverged. Buyers should route by measured cost-per-completed-task with effort level as an explicit parameter, and re-run that math before Sonnet 5's introductory pricing lapses on August 31. Meituan's MIT-licensed LongCat-2.0 rounds out the week by adding demand-proven open-weight pressure — and a claimed NVIDIA-free trillion-scale training run — to the cost lane.
Download executive brief (PDF)
Issue 10/Week 26 of 2026/June 27, 2026
No new model reset the board; the model-layer story moved to availability, cost, and serving economics.
W26 was a stabilization week for the model layer. Claude Opus 4.8 remained the practical closed-frontier leader, GPT-5.5 stayed the primary OpenAI challenger, and GLM-5.2 / DeepSeek V4 Pro continued to define the open-weight cost-pressure lane. The important shift was not another flagship release; it was that model selection is now being mediated by serving constraints. Groq's $650M inference-cloud raise, Micron's HBM4 revenue/ramp data, and NVIDIA's Vera Rubin / Spectrum-X Ethernet Photonics production language all point to the same procurement reality: frontier quality matters, but agentic production workloads are bottlenecked by where tokens run, how memory is allocated, and whether the workflow can be verified and budgeted. The buyer implication is to stop treating the leaderboard as the procurement plan. Keep closed flagships for highest-risk reasoning, benchmark GLM-5.2 / DeepSeek V4 Pro for routine coding and high-volume tasks, and evaluate inference providers on latency, capacity, geography, and fallback semantics.
Download executive brief (PDF)
Issue 09/Week 25 of 2026/June 20, 2026
Open weights took the lead while the closed frontier stalled — and the benchmark itself was rewritten.
W25 was a loud week for open weights and a quiet one at the closed frontier. Z.ai shipped GLM-5.2 under a genuine MIT license — a ~744B / ~40B-active sparse-attention MoE with 1M context that independent testing (VentureBeat) says beats GPT-5.5 on several long-horizon coding benchmarks at roughly one-sixth the cost, and that Artificial Analysis now cites as the leading open-weight model. MiniMax-M3's sparse-attention weights matured in-window with an arXiv report validating its efficiency claims, though its non-OSI Community License gates commercial use. The closed frontier, by contrast, marked time: no GA from OpenAI (GPT-5.6 remains rumor) or xAI, Gemini 3.5 Pro slipped from June to July, and Anthropic's Claude Fable 5 stayed government-suspended the entire week (Opus 4.8 is the working leader). The third shift was measurement itself — Artificial Analysis rebased its Intelligence Index to v4.1, re-weighting the industry's headline benchmark around agentic tasks, so scores are no longer back-comparable to v4.0. The procurement implication: open self-host is now a live coding option, not a hedge; teams should pilot MIT-licensed GLM-5.2, read every 'open' license carefully (GLM-5.2 MIT vs MiniMax Community), and re-baseline evaluations on the agentic v4.1 index while keeping closed-frontier fallbacks given the demonstrated availability risk.
Download executive brief (PDF)
Issue 08/Week 24 of 2026/June 13, 2026
A new closed frontier shipped and was switched off in the same week; the durable open progress was a diffusion efficiency path.
W24's model story had two halves. Anthropic shipped Claude Fable 5 — a new Mythos-class tier above Opus — and it debuted #1 on the independent Artificial Analysis Intelligence Index at 64.9, roughly five points ahead of GPT-5.5. Three days later (Jun 12), a U.S. government export-control directive forced Anthropic to disable Fable 5 and its safeguards-lifted Mythos 5 sibling for every customer, routing queries back to Opus 4.8. So the public closed-frontier leader for half a week is, at week's end, unavailable — the working default is again Opus 4.8. Meanwhile the open layer's real progress was efficiency, not a new crown: Google DeepMind released DiffusionGemma, an Apache-2.0 text-diffusion model that generates ~1,000+ tok/s on an H100, and Cohere shipped North Mini Code, a cheap self-host coding MoE; MiniMax-M3 was announced (AA Index 55) but its weights are still pending. Gemini 3.5 Pro remained not-GA. The procurement implication is sharper than another leaderboard reshuffle: a top-tier closed model can now be revoked by a third party, so production systems need a standardized eval harness and a hard fallback router, with cheap open/local models carrying routine work and the closed frontier reserved for high-value reasoning.
Download executive brief (PDF)
Issue 07/Week 23 of 2026/June 6, 2026
No new closed frontier shipped; the open/local agent substrate widened underneath it.
W23 did not produce the expected Gemini 3.5 Pro GA or a fresh Anthropic/OpenAI frontier release. That absence is the story: Claude Opus 4.8 remains the public closed-frontier leader for coding and agentic work, while Google kept Pro in the June watch window and the model layer's actual shipping activity moved down-stack. JetBrains released Mellum2, an Apache-2.0 12B/2.5B-active MoE designed for low-latency routing, RAG, summarization, validation, and sub-agent calls; NVIDIA released Cosmos 3 as an open physical-AI omni-model with Nano 16B and Super 64B variants; H Company released Holo3.1 with local computer-use sizes and quantized checkpoints. The procurement implication is sharper than another leaderboard reshuffle: production agent systems are becoming portfolios of models. Keep Opus/GPT/Gemini-class models for high-risk reasoning and codebase-scale orchestration, but push cheap, private, repeated sub-agent work into specialized open/local models. The tree delta therefore adds efficient-agent and physical-AI nodes rather than another general chatbot crown.
Download executive brief (PDF)
Issue 06/Week 22 of 2026/May 30, 2026
A quiet release week, a loud leaderboard: Anthropic retook the frontier with Opus 4.8.
W22 was the inverse of W21 — almost nothing shipped, but the one thing that did reset the top of the board. Claude Opus 4.8 (May 28) retook #1 on the Artificial Analysis Intelligence Index at 61.4, edging GPT-5.5's 60.2 and opening a >10-point SWE-Bench Pro lead (69.2%) over the rest of the generally-available field — Anthropic's first public #1 since GPT-5.5 launched in April. Crucially the win came with no list-price change ($5/$25 per Mtok held flat) while fast mode dropped ~3x and the cacheable-prompt minimum fell to 1,024 tokens, so buyers got capability and unit-economics in one 41-day cycle. The week's loudest 'delta' was a non-event: Gemini 3.5 Pro did not ship and slipped to its June GA window, leaving Opus 4.8 unchallenged at the closed frontier. Open weights were genuinely empty in-window — the only adjacent drop was NVIDIA's tri-mode Nemotron-Labs-Diffusion (May 23, just before the window opened) — while DeepSeek made its 75% V4-Pro price cut permanent, removing the May 31 cost-cliff buyers had modeled. The read for procurement: re-baseline coding-agent evals on Opus 4.8 now, but pin effort levels before you do, because effort control and parallel-subagent orchestration mean a single SKU now spans a wide cost/quality curve.
Download executive brief (PDF)
Issue 05/Week 21 of 2026/May 23, 2026
Six new tree rows. The loudest model-layer week of Q2 — Gemini 3.5 + Omni, Qwen3.7-Max in the global top 5, three open-weights drops that reset what 'open' means.
W21 (May 18-23, 2026) was the loudest model-layer week of Q2. Google's I/O turned out larger than expected — the long-anticipated 'Gemini 3.2 Flash' shipped as Gemini 3.5 Flash, paired with native multimodal video generation in Omni Flash and the Antigravity 2.0 standalone agent IDE; Alibaba's same-week Cloud Summit pushed Qwen3.7-Max into the global AA Intelligence Index top 5 (a first for a Chinese model at 56.6, ahead of Gemini 3.5 Flash); three open-weights drops — Cohere Command A+ (first Apache 2.0 frontier-adjacent MoE), Microsoft Fara1.5 (browser agents that out-score OpenAI Operator and Gemini Computer Use), Tencent Hy-MT2 (translation MoE that runs offline on a phone) — reset what 'open' means at the frontier-adjacent tier. The structural shift underneath all six launches is the same: every flagship was engineered for sustained agentic loops first and chat second, with 1M-token context now table stakes and speed-per-task on the Pareto frontier replacing peak-intelligence as the procurement metric. Gemini 3.5 Flash hits Terminal-Bench 76.2% / MCP Atlas 83.6% at ~280 output tokens per second — outperforming prior-gen Gemini 3.1 Pro on agentic benchmarks at 4x throughput. Qwen3.7-Max demonstrated a 35-hour autonomous run with 1,158 tool calls. The unit of competition shifted from raw intelligence to runtime monetization and ecosystem lock-in. NVIDIA's Q1 FY27 print (May 20: $81.6B revenue, Vera Rubin announced for Q3 ship, Vera CPU already in OpenAI / Anthropic / Oracle / xAI hands, $145B supply commitments) confirmed the compute supply chain that makes all this possible is still ramping, not peaking. Huang named Anthropic alongside the hyperscalers as a Blackwell deployment customer — the compute-side demand matches the equity bid behind Anthropic's $30B / $900B round closing imminently per Bloomberg (May 22). The specialist canopy widened structurally. Cohere Command A+ (218B / 25B active MoE, Apache 2.0, runs on 2x H100s) closed the open-vs-closed gap on enterprise RAG and tool-use with τ²-Bench Telecom jumping 37% to 85% generation-over-generation; AA-Omniscience Non-Hallucination #1 at 86%. Microsoft Fara1.5-27B (open-weight, Qwen3.5 fine-tune) scored 72% Online-Mind2Web — beating OpenAI Operator (58.3%) and Gemini Computer Use (57.3%) — collapsing the per-task cost for browser-agent fleets. Tencent Hy-MT2 at 30B-A3B runs offline on a phone after 1.25-bit quantization. Notable absences: no DeepSeek R3, no Kimi K3, no Llama 5, no Grok 4.4 / 4.5, and Anthropic's I/O-week response was infrastructure (self-hosted Managed Agent sandboxes, MCP tunnels) rather than a new model. For the May 25 - June 13 window, the load-bearing catalysts are Anthropic's round close, the Samsung-union ratification vote May 27-28, Microsoft Build (June 2-3), Computex Taipei + NVIDIA GTC Taipei (June 1-5), and any frontier-text counter-launch from Anthropic / OpenAI.
Download executive brief (PDF)
Issue 04/Week 20 of 2026/May 17, 2026
Frontier text took a breath; the specialist canopy widened — video, on-device multimodal, open world-models — and agent-runtime monetization became the story.
Frontier text models took a breath this week — no new GPT-class or Claude refresh between W19 and Google I/O (May 19-20), and LMArena top-of-board moved less than 1 Elo. But the specialist canopy widened sharply, with three model rows joining the tree across video / embodied, on-device multimodal, and open-source world modeling. Perceptron Mk1 collapsed video and embodied reasoning costs by 80-90% on May 12, pricing frontier-grade physical-AI perception under Gemini Flash Lite ($0.15 / $1.50 per Mtok, 85.1 EmbSpatialBench, 72.4 RefSpatialBench) and forcing a re-cost of any 2H roadmap that ships robotics, surveillance, or screen-watching agents. NVIDIA's open-weights SANA-WM (May 15, Apache 2.0) put a minute-scale 720p world model on a single RTX 5090 in 34 seconds — robotics, AV simulation, and synthetic-data teams can now in-house what they previously rented from closed video APIs. OpenBMB's MiniCPM-V 4.6 1.3B (May 11) reset the low end of multimodal SLMs at 262K context with native iOS / Android / HarmonyOS deployment, compressing on-device feature timelines from quarters to weeks. On the platform layer, the unit of competition shifted from model intelligence to agent-runtime monetization. Anthropic's Code with Claude conference (May 6-12) made Managed Agents — Dreaming (between-session memory consolidation), Multiagent Orchestration, Outcomes (rubric-graded), and signed Webhooks — a hosted service; Claude Platform on AWS hit GA May 11; xAI shipped Grok Build CLI on May 14 with a SuperGrok Heavy tier at $300/mo; OpenAI's realtime voice trio (GPT-Realtime-2 / Translate / Whisper, added W19) went broadly available May 11 with 128K context and adjustable reasoning effort. Architects building bespoke agent stacks should justify build vs buy explicitly in 2H plans. The W18 Pentagon vendor-disqualification thread compounded. UK AISI published paired evaluations May 13 confirming autonomous offensive-cyber capability is now doubling every 4.7 months (aligned with METR's 4.2-month SWE figure), and Anthropic emailed Max-20x subscribers a June 15 policy that moves Claude Code third-party agents off subscription rate limits — 12x-175x effective price increase per workload. Capability gating, federal procurement, and pricing structure are now durable vendor-risk dimensions; bench scores alone no longer settle a procurement decision. For the May 19-26 window, the load-bearing catalysts are Google I/O 2026 (Gemini 3.2 Flash / Pro confirmation, fabric session), NVIDIA Q1 FY27 (May 20), Microsoft Build (May 19-22), the Samsung HBM4 walkout starting May 21, and any Anthropic / OpenAI counter-launch following I/O.
Download executive brief (PDF)
Issue 03/Week 19 of 2026/May 8, 2026
Frontier text went quiet. Voice crossed into chain-of-thought. The week's biggest capability move was a compute deal.
W19 was the week the model layer split. Frontier text shifted into a maintenance cadence — no new flagship landed, the AA Intelligence Index ceiling held at GPT-5.5 (xhigh) = 60, and OpenAI's notable release was a tuned variant: GPT-5.5 Instant became the new ChatGPT default with 52.5% fewer hallucinations on high-stakes prompts. Google's Gemini 3.2 Flash surfaced in the iOS app and AI Studio without an announcement two weeks before I/O — leak-grade only. The center of gravity moved sideways into voice and into compute supply. OpenAI's Realtime trio (GPT-Realtime-2 + Translate + Whisper) shipped on May 7 with GPT-5-class reasoning baked into the audio stack — Big Bench Audio jumped 15.2 points, Audio MultiChallenge instruction-following 13.8 points. Inworld shipped Realtime TTS-2 on the same day with conversational turn-awareness and emotional steering. 'Voice model' is now a frontier tier with its own benchmarks (Big Bench Audio, Audio MultiChallenge, Speech Arena Elo), not a checkbox feature. The most consequential capability move was a compute deal. Anthropic + SpaceX (May 6) — 300+ MW at Colossus 1 in Memphis, ~220K NVIDIA GPUs within the month — paired with same-morning announcements that doubled Claude Code 5-hour rate limits, removed peak-hours throttling, and raised Opus API limits. The W19 read: the binding constraint on frontier deployment has flipped from model capability to power-and-GPU supply. The open frontier hardened on efficiency, not absolute capability. Zyphra ZAYA1-8B (May 6) hits AIME 2025 91.9% with 700M active parameters and Markovian RSA test-time compute — and is the first frontier-class MoE trained end-to-end on AMD MI300X + Pensando + ROCm via IBM Cloud. Allen AI EMO (May 8) shows 12.5% of experts can carry near-full performance via emergent semantic specialization. Three independent labs in two weeks have shipped reasoning-credible MoE models with ≤3B active parameters — architects pricing 70B-class inference should re-validate against the sub-billion-active band before committing 2027 capacity. For the May 12-22 window, the load-bearing catalysts are NVIDIA Q1 FY27 (May 20), Google I/O 2026 (May 19-20), Microsoft Build (May 19-22), the CoreWeave 10-Q (May 11-21), and the Anthropic round close.
Download executive brief (PDF)
Issue 02/Week 18 of 2026/May 1, 2026
Six new tree rows, two distribution earthquakes, and the open-weights ceiling closing to 6 points behind GPT-5.5.
W18 is the model layer's first true cadence test under the new weekly Pulse: six new tree rows across five vendors, two distribution earthquakes (Microsoft / OpenAI exclusivity formally ends, GPT-5.5 + Codex + Bedrock Managed Agents land natively on AWS), and the first independent UK AISI sabotage evaluation showing Opus 4.7 actually got safer than 4.6 while Mythos Preview showed a 65% reasoning-output discrepancy. The procurement read shifted in two ways: closed-frontier intelligence is now decoupled from cloud residency for the first time (Bedrock-native GPT-5.5 closes the Azure-only gap), and the open-weights ceiling closed to 6 points behind GPT-5.5 on the AA Intelligence Index — a year ago that gap was 13. The architecture signal: hybrid attention plus 11x-30x sparse-MoE ratios are now the default at both the open frontier (MiMo, DeepSeek, Kimi) and the specialist edge (OpenAI Privacy Filter, Laguna XS.2, Nemotron Nano Omni), so KV-cache and active-parameter footprint are diverging from total parameter count as the dominant procurement variables. The W17 watchlist's predicted closed-frontier coding response resolved as GPT-5.5 itself replacing GPT-5.3-Codex as the in-product Codex default rather than a separate SKU; Anthropic's response remains Opus 4.7 generally available with no -Codex variant — the era of named coding-tier sub-SKUs may be over for the leading two labs. The Pentagon also weighed in on Mythos before Anthropic did: a Friday May 1 eight-vendor classified-network procurement deal explicitly excluded Anthropic on supply-chain-risk grounds with Mythos cited as a separate national security moment, making capability-gating a federal-procurement disqualifier rather than only a research-org concern. For the May 2-9 window, the load-bearing catalysts are Anthropic's response to the AISI Mythos disclosure, Moonshot / DeepSeek free-token responses to Xiaomi's 100T Orbit Plan, and whether Google breaks its post-Gemini-3.1-Pro silence.
Download executive brief (PDF)
Issue 01/April 2026 Recap/April 25, 2026
April rewrote the floor: open weights crossed the closed frontier, and the safety-gated frontier became a procurement criterion.
April 2026 was the most productive month in frontier model history. Eleven model rows landed in the LLM Evolutionary Tree across eight vendors, including the first open-weights MoE to score frontier-class on SWE-Bench Pro coding (Kimi K2.6 at 58.6 and GLM-5.1 at 58.4, both above GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3). DeepSeek V4 shipped in two configurations under MIT license with 1M-token context and 1.6T total parameters, moving on-prem coding from 'can we?' to 'which workload first?' Anthropic withheld its Mythos flagship on cyber-capability grounds after UK AISI confirmed autonomous offensive capability — an inflection that turns capability gating from a research-org concern into a procurement diligence requirement. The closed frontier responded: GPT-5.5 (Apr 23) and Claude Opus 4.7 (Apr 16) reset the closed-source ceiling, with adaptive-thinking and ultra-long-context as the new defaults. Read together, April was the month the canopy widened on three axes at once: open-vs-closed parity on coding, reasoning-as-default at every tier, and capability gating as a market signal.
Download executive brief (PDF)

The methodology.

— Every issue starts with the tree delta — the model rows added or updated in content/llm-tree/models.yaml since the prior issue.
— Frontier and open-weights releases are tagged by tier and architecture so procurement reads can scan to the relevant class.
— Architecture watch names patterns that crossed multiple vendors. One pattern, several exemplars, what it changes.
— Benchmark moves cite public leaderboards and vendor model cards as of publish date.
— Vendor signals cover the non-release moves: pricing, gating, deprecation, license changes.
— All sources are public. Independent analysis only.

Operate. Publish. Teach.

Published issues.

Opus-class intelligence just moved into the mid-tier price band, while Google reset the fleet-economics floor

Kimi K3 topped an LMArena flagship board, priced itself onto the closed vendors' rate card — and shipped without weights. Meanwhile the largest US-origin open-weights model actually landed, and OpenAI showed what it keeps for itself.

Four frontier releases in four days — and the week's real lesson came from the community: Hy3 went from Apache-2.0 drop to local deployment in 30 hours, while Databricks proved the harness now matters more than the model.

Washington entered the release pipeline: GPT-5.6 launched gated to ~20 approved partners, Fable 5 came back, and Sonnet 5 split sticker price from cost-per-task.

No new model reset the board; the model-layer story moved to availability, cost, and serving economics.

Open weights took the lead while the closed frontier stalled — and the benchmark itself was rewritten.

A new closed frontier shipped and was switched off in the same week; the durable open progress was a diffusion efficiency path.

No new closed frontier shipped; the open/local agent substrate widened underneath it.

A quiet release week, a loud leaderboard: Anthropic retook the frontier with Opus 4.8.

Six new tree rows. The loudest model-layer week of Q2 — Gemini 3.5 + Omni, Qwen3.7-Max in the global top 5, three open-weights drops that reset what 'open' means.

Frontier text took a breath; the specialist canopy widened — video, on-device multimodal, open world-models — and agent-runtime monetization became the story.

Frontier text went quiet. Voice crossed into chain-of-thought. The week's biggest capability move was a compute deal.

Six new tree rows, two distribution earthquakes, and the open-weights ceiling closing to 6 points behind GPT-5.5.

April rewrote the floor: open weights crossed the closed frontier, and the safety-gated frontier became a procurement criterion.

The methodology.