All research
Machine Learning 2026-03 17 min read

Beyond Attention: The Future of Transformer Architectures

A 2026 Survey

By Byiringiro Thierry · 2026-03

transformers mamba state-space moe test-time-training

Beyond Attention: The Future of Transformer Architectures

A 2026 Survey Byiringiro Thierry · March 2026

1. Abstract

The Transformer (Vaswani et al., 2017) won. Nine years on, every frontier model — GPT-5, Claude 4.6/4.7, Gemini 3, DeepSeek-R2 — is a Transformer at its core. But the architectural mono-culture has cracks. Three forces are pulling the field toward post-Transformer designs:

  1. KV-cache memory is the real wall, not FLOPs. A 128K-context inference on a 70B model spends more bytes on KV-cache than on weights.
  2. Sequential reasoning workloads (agents, tool-use chains, long documents) want constant-time state updates that quadratic attention cannot give.
  3. On-device inference, increasingly the high-volume use case, needs models that run at 1B–7B with frontier-class long-context behavior — impossible with vanilla attention.

This paper surveys the architectures gunning for the post-Transformer crown and argues that no single architecture will replace it. The 2027 dominant model will be a hybrid stack.

2. The Transformer, by 2026, in five facts

Before surveying alternatives, calibrate. The vanilla Transformer in 2026 is not the 2017 paper:

  • Rotary positional encoding (RoPE) is the default — absolute and learned position embeddings are deprecated. Most production models use NTK-scaled or YaRN-extended RoPE for context extension.
  • SwiGLU is the default FFN activation. ReLU and GELU are deprecated in frontier-scale models.
  • RMSNorm is the default normalization. LayerNorm survives only in legacy code.
  • Grouped-query attention (GQA) is universal at frontier scale — KV heads = 8 is the default; multi-query attention (MQA, KV heads = 1) is now reserved for inference-optimized deployments.
  • Flash Attention 3 and its successors (FlashDecoding, ThunderKittens kernels) make naive attention compute essentially free up to ~16K context. The bottleneck has moved decisively to memory bandwidth, not arithmetic.

In short: the "vanilla" Transformer of 2026 is RoPE + SwiGLU + RMSNorm + GQA + Flash. This is the strawman.

3. Linear-attention variants

The first wave of post-Transformer architectures replaced softmax(QKᵀ) with a kernel that factorizes:

$\text{Attention}(Q, K, V) \approx \phi(Q) \cdot (\phi(K)^\top V)$

where φ is a feature map (random Fourier features in Performer, low-rank projections in Linformer, kernel functions in TransNormer/RetNet). This gives O(n·d) compute and constant-memory recurrence.

The dirty secret: linear attention loses non-trivial quality on retrieval-heavy benchmarks. The associative-recall task — given a sequence of (key, value) pairs and a query key, return the value — is the canary in the coal mine. Vanilla attention nails it; linear attention scrambles for >2-key tasks.

RetNet's "retention" formulation and Mamba-2's SSD interpretation both partially fix this by introducing structured state matrices, but neither matches softmax attention's flexibility on the recall task. As of 2026, linear attention is mostly a secondary layer in hybrid stacks — never the only mechanism.

4. State-space models (SSMs)

Mamba (Gu & Dao, 2023) and its descendants are the most exciting post-Transformer family. The core idea: replace attention with a selective state-space recurrence:

$h_t = A_t h_{t-1} + B_t x_t y_t = C_t h_t$

where A, B, C are input-dependent (the "selective" part — earlier S4 had time-invariant SSMs). Compute is O(n·d), state is bounded, parallelization is via parallel scan.

2026 status of the SSM family:

  • Mamba-1 (Dec 2023) — original selective SSM. Proved SSMs could match Transformers on language modeling at <3B scale.
  • Mamba-2 (May 2024) — recast as structured state-space duality; unifies SSMs with linear attention. 8× throughput improvement.
  • RWKV-7 (2025) — receptance-weighted key-value; finite-state-machine interpretation. Currently the most production-deployed SSM (used in commercial chatbots for low-latency inference).
  • Hyena Hierarchy — long convolutions instead of attention. Competitive at 1B scale but doesn't generalize cleanly past 7B.

Where SSMs win: long-context (>64K tokens), low-latency single-token decode, on-device. RWKV-7 at 3B runs ~3× faster than Llama-3.2-3B on iPhone 15 Pro Max.

Where SSMs lose: copying tasks, exact retrieval, multi-step in-context reasoning. The compressed state is lossy, by design. A 70B Mamba cannot reliably perform a 30-step planning chain that a 70B Transformer can.

This is the unsolved problem for pure-SSM scaling.

5. Hybrid stacks

The pragmatic resolution: interleave attention and SSM layers. This is what the most successful 2025-2026 frontier-adjacent models did:

  • Jamba (AI21 Labs, March 2024) — 1:8 attention:SSM ratio. 52B total params, 12B active (MoE). 256K context.
  • Zamba2 (Zyphra, 2025) — 2:6 ratio. Competitive perplexity at half the parameter count of vanilla Mamba.
  • Samba (Microsoft, 2024) — sliding-window attention + selective SSM. Strong long-context performance.

The architectural argument: attention layers handle the precise retrieval (the things SSMs struggle with), SSMs handle the bulk of the sequential bandwidth (the things attention is wasteful on). The empirical evidence supports this: hybrid models out-perform pure Transformers and pure SSMs at matched compute on most general benchmarks.

My bet: the 2027 frontier model will be a hybrid. Expect ratio in the 1:4 to 1:8 range, with sliding-window attention (not full attention) for the attention layers.

6. Mixture of Experts (MoE)

Orthogonal to the attention-vs-SSM debate: how to scale parameters without scaling per-token compute. The answer for the last three years has been MoE — gate per-token to a subset of expert FFN modules.

  • DeepSeekMoE / DeepSeek-V3 — fine-grained experts (256+ small experts, top-8 routing). Demonstrated frontier-quality inference at <100B active params.
  • Mixtral 8×22B / 8×7B — coarse experts. Less expressive but easier to serve.
  • Switch Transformer descendants — load-balancing losses, capacity factors, expert dropout.

Open problems in 2026:

  • Expert-imbalance failure modes. Even with auxiliary load-balancing losses, real workloads create hot experts. Production serving needs dynamic re-balancing.
  • Routing under context. Routing decisions made on a single token's hidden state, with no context awareness, are obviously sub-optimal. Test-time routing optimization (route-then-recompute) is an active research area.
  • Expert merge. When two experts converge to nearly-identical functions, they should be merged in-place. Current frontier MoE training does this manually; we need automated routines.

7. Test-Time Training (TTT)

The most genuinely-novel direction is Test-Time Training (Sun, Kim, Kakade et al., 2024 → 2026 frontier results). Reframe the question: what if the model's hidden state itself learns during inference?

Standard Transformer: forward pass is deterministic, weights are frozen.

TTT: each token's processing includes a few SGD steps that update an inner model (a small MLP or transformer) on a self-supervised loss derived from the input. The "hidden state" becomes a learned model that updates online.

This generalizes linear attention (Mamba is a special case of TTT with linear inner model + specific loss) and gives qualitatively new capabilities: continual learning within a single sequence, much stronger in-context behavior past 256K tokens, and natural extension to non-text modalities.

The 2026 status: works at 1B–3B; scaling laws past 7B are not yet established. Inference overhead is the killer — every token now includes inner-loop gradient steps, costing 2-4× a vanilla Transformer forward pass. Hardware-aware kernels (analogous to Flash Attention's role for vanilla Transformers) are the bottleneck.

If TTT scales, it is the most fundamental architectural change since 2017.

8. The 2027 stack — my prediction

Synthesizing the above, the dominant frontier architecture in late-2027 / early-2028 will be:

[Embedding + RoPE]
  ↓
[Block 1]  Mamba-2-style selective SSM  (constant-state, fast long-context)
[Block 2]  Sliding-window attention (1K window, no full O(n²))
[Block 3]  MoE FFN (DeepSeek-style, 256 experts, top-8 routing)
  ↓
[Block 4]  SSM
[Block 5]  Sliding-window attention
[Block 6]  MoE FFN
  ↓
  ... repeat × 24-32 ...
  ↓
[TTT inference-time correction over the last 4-8K tokens]
  ↓
[RMSNorm + lm_head]

Predicted properties:

  • 200B–1T total params; 25-50B active per token (MoE sparsity 4-5%)
  • 1M+ context, with sub-quadratic memory growth
  • 5-10× cheaper inference per token than 2026 frontier dense Transformers
  • Online learning within a sequence via TTT — enabling agent systems to "remember" within long-horizon tasks

9. Implications

For agent infrastructure. If TTT scales, "context engineering" — the labor of cramming task state into prompts — becomes less central. The model itself becomes a stateful agent. Tools like LangChain that exist to wrangle hidden state get obsoleted.

For on-device inference. Pure SSM (RWKV-7-style) and small hybrid models become the on-device default. Frontier-quality at 1B–3B parameters in 2027 is plausible.

For frontier model economics. Inference costs continue to plummet, but only for hybrid/MoE/TTT architectures. Vanilla-Transformer providers ship inference at 5-10× the cost of the new stack — a margin pressure equivalent to AWS pricing out single-server competitors.

For evaluation. Benchmarks designed around 2017-2024 Transformers (and especially around the strict prompt/response, deterministic-forward-pass paradigm) become less informative. Continual-learning, multi-day-agent, and rolling-context benchmarks become first-class.

10. Conclusion

Attention is not dead. It's promoted — from "the whole architecture" to "one of three primitives in a hybrid stack." The 2027 dominant architecture will not be one new thing; it will be a careful blend of selective SSMs, sliding-window attention, MoE routing, and (probably) Test-Time Training. The frontier model that wins is not the one with the best attention mechanism — it's the one whose engineering team can ship the most complex hybrid pipeline without breaking it.

That, more than any specific mechanism, is the bet to make.


References

  1. Gu & Dao — Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023)
  2. Dao & Gu — Transformers are SSMs: Mamba-2 (2024)
  3. Peng et al. — RWKV-7 "Goose": Receptance Weighted Key Value (2025)
  4. AI21 Labs — Jamba: A Hybrid Transformer-Mamba Language Model (2024)
  5. Sun, Kim, Kakade et al. — Test-Time Training: A Foundation for Continual Learning (2024)
  6. DeepSeek-AI — DeepSeek-V3 Technical Report (2024)
  7. Shazeer — Fast Transformer Decoding: One Write-Head is All You Need (2019)
  8. Dao et al. — Flash Attention 3 (2024)