---
title: AI Shockwave Timeline
document: ai-shockwave-timeline
canonicalUrl: https://brianletort.ai/industry/shocks
schemaVersion: 2026.05.02
lastUpdated: '2026-05-29'
dateRange:
  start: '2024-02-15'
  end: '2026-05-28'
counts:
  shocks: 15
  archetypes: 5
  influenceLinks: 8
  sources: 15
archetypes:
  - efficiency
  - reasoning
  - context
  - open_frontier
  - interoperability
---

# AI Shockwave Timeline

*The events that reset frontier AI assumptions from 2024-02-15 to 2026-05-28, with before→after impact deltas. Last updated 2026-05-29.*

**Counts.** 15 shocks, 5 archetypes, 8 influence links, 15 sources.

## Definition

A "shock" is an event that materially reset a frontier assumption: how much capability can be bought per unit of compute, how much reasoning can be unlocked at inference time, or how far the bottleneck moved into memory, interconnect, and protocols. All metrics are official reported figures unless independently reproduced.

## Archetypes

- **Efficiency shock** (`efficiency`) — signature: $/GPU-hour, KV cache, tokens/sec, PFLOPS/W. The winning frontier increasingly comes from servability and cost per useful token, not only benchmark rank.
- **Reasoning shock** (`reasoning`) — signature: AIME, GPQA, Codeforces, thinking budget. Reasoning moved from "harder prompts" to an explicit computational budget allocated at inference time.
- **Context & interface shock** (`context`) — signature: context length, retrieval recall, latency, modalities. Context and real-time interaction became product features and an infrastructure burden at the same time.
- **Open-frontier shock** (`open_frontier`) — signature: params, context, license/distillation terms, ecosystem. The closed/open separation narrowed enough to change enterprise adoption and price expectations.
- **Interoperability & network shock** (`interoperability`) — signature: supported standards/partners, protocol, fabric bandwidth. The model stopped being the whole system; protocols and fabrics became the strategic choke points.

## Shocks

### Gemini 1.5 — 2024-02-15 (`gemini-1-5`)

**Long context becomes a production capability, not a demo.**

A new MoE architecture with 128K standard context, a 1M-token preview, and report evidence of near-perfect retrieval beyond 10M tokens.

Archetype: context. Layers: software. Magnitude: high (85/100). Confidence: high.

Impact:
  - _capability_ — Standard context window: 32K → 1M (↑ 31x) **[headline]**
  - _systems_ — Long-context retrieval recall: >99%
  - _capability_ — Tested retrieval ceiling: 10M tokens

Downstream — short: Long-document and long-video workflows become practical. medium: RAG assumptions weaken as more state fits in-context. long: Context length becomes a pricing and systems-architecture variable.

### NVIDIA Blackwell & GB200 NVL72 — 2024-03-18 (`nvidia-blackwell`)

**The rack becomes the accelerator.**

Fifth-generation NVLink and a 72-GPU rack exposing 130 TB/s of low-latency GPU communication, with claims of 30x real-time trillion-parameter inference vs the H100 era.

Archetype: efficiency. Layers: hardware, networking. Magnitude: high (88/100). Confidence: high.

Impact:
  - _economics_ — Real-time LLM inference: 1x (H100) → 30x (↑ 30x) **[headline]**
  - _economics_ — Cost & energy per inference: 1x → 0.04x (↓ 25x lower)
  - _systems_ — Intra-rack GPU bandwidth: 130 TB/s

Downstream — short: Hyperscaler capex pivots toward rack-scale systems. medium: Rack-scale "AI factory" design becomes standard. long: Interconnect bandwidth becomes a first-class competitive moat.

### DeepSeek-V2 — 2024-05-06 (`deepseek-v2`)

**KV-cache compression changes the economics of serving.**

A 236B-total / 21B-active MoE with Multi-head Latent Attention and DeepSeekMoE, reporting dramatic KV-cache and training-cost reductions.

Archetype: efficiency. Layers: software. Magnitude: high (84/100). Confidence: high.

Impact:
  - _economics_ — KV cache footprint: 100% → 6.7% (↓ -93.3%) **[headline]**
  - _economics_ — Training cost vs DeepSeek 67B: 100% → 57.5% (↓ -42.5%)
  - _systems_ — Max generation throughput: 1x → 5.76x

Downstream — short: Cheaper open serving on commodity GPU fleets. medium: KV compression and MLA become design references. long: Architecture shifts toward memory-aware inference efficiency.

### GPT-4o — 2024-05-13 (`gpt-4o`)

**Latency drops into conversation range.**

An end-to-end omni model for text, image, and audio with sub-second audio response latency and a 50% cheaper API than GPT-4 Turbo.

Archetype: context. Layers: software. Magnitude: high (82/100). Confidence: high.

Impact:
  - _capability_ — Audio response latency: ~2800 ms → 320 ms (↓ ~9x faster) **[headline]**
  - _economics_ — API price vs GPT-4 Turbo: 100% → 50% (↓ -50%)
  - _capability_ — Modalities in one model: text + image + audio

Downstream — short: Voice agents and live multimodal UX become credible. medium: Separate modality stacks look increasingly obsolete. long: Real-time multimodal inference becomes a baseline expectation.

### Llama 3.1 405B — 2024-07-23 (`llama-3-1-405b`)

**Open weights reach the frontier.**

The "first frontier-level open source AI model," with 128K context, training on over 15T tokens across more than 16,000 H100 GPUs.

Archetype: open_frontier. Layers: software. Magnitude: high (80/100). Confidence: high.

Impact:
  - _capability_ — Open-weight frontier parity: first frontier-level open model **[headline]**
  - _systems_ — Context window: 8K (Llama 2) → 128K (↑ 16x)
  - _economics_ — Training scale: >15T tokens / 16k H100

Downstream — short: Open-weight pilots expand inside enterprises. medium: Synthetic-data generation and distillation accelerate. long: Open weights place structural pricing pressure on closed APIs.

### OpenAI o1 — 2024-09-12 (`openai-o1`)

**Reasoning gets its own compute budget.**

A reasoning series designed to "spend more time thinking," with smooth gains from both train-time and test-time compute.

Archetype: reasoning. Layers: software. Magnitude: high (88/100). Confidence: high.

Impact:
  - _capability_ — AIME 2024 (pass@1): 13.4% (GPT-4o) → 74.4% (↑ +61 pts) **[headline]**
  - _capability_ — GPQA Diamond (pass@1): 77.3%
  - _systems_ — New scaling axis: test-time compute

Downstream — short: Separate "reasoner" tiers appear in model menus. medium: Thought-budget controls and model routing proliferate. long: Inference compute allocation rivals pretraining scale.

### Model Context Protocol — 2024-11-25 (`mcp`)

**Model-to-tool connectivity standardizes.**

An open standard for secure, two-way connections between AI tools and data sources, with a spec/SDK and reference servers.

Archetype: interoperability. Layers: networking, software. Magnitude: medium (65/100). Confidence: high.

Impact:
  - _systems_ — Model-to-tool integration: bespoke per-tool connectors → one open protocol **[headline]**
  - _capability_ — Reference servers at launch: Drive, Slack, GitHub, Git, Postgres, Puppeteer

Downstream — short: Bespoke connector duplication starts to look wasteful. medium: MCP server ecosystems expand across vendors. long: Tools and data become modular "ports" for any model.

### DeepSeek-V3 — 2024-12-26 (`deepseek-v3`)

**Open frontier quality at a visible GPU-hour cost.**

671B total / 37B active with FP8 mixed-precision training, multi-token prediction, and cross-node MoE communication overlap.

Archetype: efficiency. Layers: software, hardware, networking. Magnitude: high (90/100). Confidence: high. Builds on: deepseek-v2.

Impact:
  - _economics_ — Full training compute: 2.788M H800 GPU-hours **[headline]**
  - _economics_ — Decoding throughput (MTP): 1x → 1.8x
  - _systems_ — Training precision: BF16 → FP8 mixed

Downstream — short: API price pressure rises across the market. medium: Open infra stacks copy FP8, MTP, and comm overlap. long: "Model quality per GPU-hour" becomes a core frontier KPI.

### DeepSeek-R1 — 2025-01-20 (`deepseek-r1`)

**Open reasoning reaches parity and becomes distillable.**

R1 and R1-Zero showed large-scale RL could induce reasoning, with a cold-start multi-stage pipeline and six distilled smaller models.

Archetype: reasoning. Layers: software. Magnitude: high (95/100). Confidence: high. Builds on: deepseek-v3.

Impact:
  - _capability_ — AIME 2024 (pass@1): 79.8% **[headline]**
  - _economics_ — Reasoning openness: closed (o1) → open weights + 6 distilled
  - _capability_ — MATH-500: 97.3%

Downstream — short: A reasoning price war begins. medium: Dense distilled reasoners proliferate. long: Open reasoning becomes a research baseline and commodity.

### Claude 3.7 Sonnet — 2025-02-24 (`claude-3-7-sonnet`)

**Thinking budget becomes product UI.**

The "first hybrid reasoning model," with a visible extended-thinking mode, API control over thinking budget, and Claude Code.

Archetype: reasoning. Layers: software. Magnitude: medium (70/100). Confidence: high.

Impact:
  - _capability_ — Thinking mode: implicit → user-controlled budget to 128K **[headline]**
  - _systems_ — Agentic coding: Claude Code (terminal)

Downstream — short: Product teams expose thought-budget controls. medium: Agentic coding workflows mature. long: Unified fast+slow models replace fragmented model menus.

### Agent2Agent Protocol — 2025-04-09 (`a2a`)

**Agents become network peers.**

A protocol with 50+ launch partners using HTTP, SSE, and JSON-RPC, positioned as complementary to MCP for multi-agent tasks.

Archetype: interoperability. Layers: networking, software. Magnitude: medium (66/100). Confidence: medium.

Impact:
  - _systems_ — Cross-agent interoperability: single-vendor orchestration → vendor-neutral protocol **[headline]**
  - _capability_ — Launch partners: 50+

Downstream — short: Enterprise pilots for cross-agent workflows rise. medium: Vendor-neutral orchestration becomes easier. long: Agent meshes become an architecture category.

### NVLink Fusion — 2025-05-18 (`nvlink-fusion`)

**Custom silicon plugs into the NVIDIA fabric.**

Opened the NVLink ecosystem to semi-custom AI infrastructure, letting custom CPUs and ASICs pair with NVIDIA rack-scale systems.

Archetype: interoperability. Layers: hardware, networking. Magnitude: medium (64/100). Confidence: medium. Builds on: nvidia-blackwell.

Impact:
  - _systems_ — Custom silicon in NVIDIA fabric: NVIDIA CPUs/GPUs only → custom CPU/ASIC pairing **[headline]**
  - _economics_ — Scale-out networking: 800 Gb/s

Downstream — short: Sovereign and custom AI stack designs become plausible. medium: CPU/ASIC heterogeneity grows inside NVIDIA fabrics. long: The fabric becomes the platform, not only the GPU.

### NVIDIA Rubin & Vera Rubin NVL72 — 2026-01-05 (`nvidia-rubin`)

**Context memory and token economics dominate infra.**

72 Rubin GPUs and 36 Vera CPUs with 260 TB/s rack bandwidth, 20.7 TB GPU memory, and claims of 10x lower inference token cost than Blackwell.

Archetype: efficiency. Layers: hardware, networking. Magnitude: medium (72/100). Confidence: medium. Builds on: nvlink-fusion.

Impact:
  - _economics_ — Inference token cost vs Blackwell: 1x → 0.1x (↓ 10x lower) **[headline]**
  - _economics_ — GPUs to train MoE: 1x → 0.25x (↓ 4x fewer)
  - _systems_ — Rack bandwidth: 130 TB/s → 260 TB/s

Downstream — short: Roadmap resets for hyperscale and neocloud buyers. medium: Long-context and agentic inference infra is redesigned. long: Context-memory and storage fabrics join the model stack.

### DeepSeek-V4 Preview — 2026-04-24 (`deepseek-v4`)

**1M context becomes default and open.**

V4-Pro and V4-Flash with a 1M-token standard context (1.6T/49B active for Pro) at a fraction of the inference FLOPs and KV cache of V3.2.

Archetype: context. Layers: software, hardware. Magnitude: high (86/100). Confidence: medium. Builds on: deepseek-v3.

Impact:
  - _capability_ — Standard context window: 128K (V3) → 1M (↑ 8x) **[headline]**
  - _economics_ — Single-token inference FLOPs vs V3.2: 100% → 27% (↓ -73%)
  - _economics_ — KV cache vs V3.2 at 1M context: 100% → 10% (↓ -90%)

Downstream — short: Long-context agent apps become practical. medium: Hybrid full-context + retrieval replaces pure-RAG defaults. long: 1M context becomes a mainstream premium/open tier.

### Claude Opus 4.8 — 2026-05-28 (`claude-opus-4-8`)

**Frontier models run long, low-supervision agentic work reliably.**

A modest-but-tangible Opus upgrade focused on reliability and agentic judgment, launched with effort control and a Claude Code "dynamic workflows" mode that runs hundreds of parallel subagents for codebase-scale migrations.

Archetype: reasoning. Layers: software. Magnitude: medium (66/100). Confidence: high. Builds on: claude-3-7-sonnet.

Impact:
  - _capability_ — Unflagged code flaws vs Opus 4.7: 1x → 0.25x (↓ 4x fewer) **[headline]**
  - _economics_ — Fast-mode price vs prior models: 1x → 0.33x (↓ 3x cheaper)
  - _systems_ — Agentic scale (Claude Code): 100s of parallel subagents

Downstream — short: Effort control and parallel-subagent orchestration spread in agentic coding. medium: Long-running, low-supervision agent workflows move from demo to default. long: Frontier competition shifts from chatbot quality to autonomous-agent reliability.

## Influence edges

- openai-o1 → deepseek-r1: popularizes test-time scaling
- llama-3-1-405b → deepseek-r1: open-weight frontier precedent
- deepseek-r1 → claude-3-7-sonnet: open reasoning pressures hybrid productization
- mcp → a2a: complements (tool vs agent interop)
- nvidia-blackwell → deepseek-v3: rack-scale economics enable FP8 co-design
- gemini-1-5 → deepseek-v4: long-context design becomes mainstream
- nvidia-rubin → deepseek-v4: context-centric infra enables 1M serving
- deepseek-v2 → gpt-4o: memory-efficiency pressure on serving cost

---

Compact JSON: <https://brianletort.ai/industry/shocks/llm.json>. Raw YAML: <https://brianletort.ai/industry/shocks/shocks.yaml>. Canonical HTML: <https://brianletort.ai/industry/shocks>.