Casa de Brain

Latest Transformer progress in LLM — Attention / MoE tricks, complexity and effective long context

MiniMax — MiniMax-M1

Source: MiniMax-M1 technical report

Attention / MoE trick

Hybrid sparse Mixture-of-Experts backbone + Lightning Attention — described in the paper as an I/O-aware implementation of a linear attention (an attention kernel engineered to reduce memory I/O and compute). Lightning Attention restructures attention computation to avoid explicit 𝑄𝐾𝑇 materialization in many layers.

Per-token complexity (paper claim / measured)

Paper frames Lightning Attention as near-linear (O(N·d)) in the regions it is applied. Concrete measured claim: “compared to DeepSeek R1, M1 consumes 25% of the FLOPs at a generation length of 100K tokens”. The model also reports ~45.9B parameters activated per token in the reported M1 configuration (hybrid MoE activation stat). Effective scaling significantly below quadratic at long N.

Practical context tested (paper numbers / benchmarks)

Native support / training context: 1,000,000 tokens (native training window for M1); inference extrapolation claims up to 4,000,000 tokens in the MiniMax-01 lineage. MiniMax-M1 reports long-context benchmark wins on RL/agentic tasks and explicit FLOP comparisons at 100K generation length.

Moonshot AI — Kimi K2 / K2.5

Source: Kimi K2 technical report

Attention / MoE trick

Ultra-sparse MoE combined with Multi-Head Latent Attention (MLA) (paper explicitly: “the architecture follows ... multi-head latent attention (MLA) similar to DeepSeek-V3”). MLA is an attention style where keys/values are handled/represented in a latent (compressed / non-fully-materialized) form. MLA avoids explicit high-dimensional KV storage by compressing keys/values into a lower-rank latent representation before interaction.

Per-token complexity (paper claim / measured)

MLA avoids full materialization of Key matrices during inference — the paper emphasizes that “Key matrices are not fully materialized during inference” for MLA and thus the implementation reduces KV memory bandwidth and communication (i.e., subquadratic memory/IO behavior in practice). The model also reports 32B activated parameters (i.e., MoE activated capacity) in the K2 configuration (1.04T total param scale / 32B activated). The paper couples MLA with selective recomputation and Muon QK-Clip to preserve stability for these efficient attention regimes.

Practical context tested (paper numbers / benchmarks)

K2 reports strong long-context benchmark performance (multiple long-context reasoning / retrieval benchmarks) across 100K+ token reasoning tasks and compares against DeepSeek and other frontier models. Demonstrated strong long-chain reasoning stability.

Alibaba Cloud - Qwen3-Next

Source: Blog

Attention / MoE Trick

DeltaNet layers eliminate quadratic token interactions via linear sequence modeling. Softmax layers periodically restore full global attention precision. GQA reduces KV memory footprint. MoE limits active parameters per token.

Per-Token Complexity

With:

Effective runtime scaling trends closer to linear at large N compared to dense transformers.

Practical Context Tested

Mistral AI — Mixtral (Mixtral 8×7B)

Source: Mixtral (Mistral) paper

Attention / MoE trick

Sparse MoE (SMoE) applied to FFN blocks:

Attention remains quadratic; efficiency gains come from conditional FFN compute.

Per-token complexity (paper claim / measured)

Per token, only a small subset of experts do work (so the large parameter capacity is available but active compute per token is much lower). Mixtral reports that a token “has access to 47B parameters, but only uses ~13B active parameters during inference” (paper figures). Attention cost stays dense O(N²) in the attention blocks, but FFN cost is top-k gated → lower effective FLOPs for the same parameter count.

Practical context tested (paper numbers / benchmarks)

Mixtral models were trained and evaluated with 32K token context and reported performance/benchmarks at that context length (paper & model card).

DeepSeek — DeepSeekMoE / DeepSeek-V3 family

Source (primary): DeepSeekMoE paper & DeepSeek technical report

Attention / MoE trick

DeepSeekMoE emphasizes expert specialization (finer expert segmentation and shared experts) plus Multi-Head Latent Attention (MLA) and KV / latent projection strategies in later DeepSeek variants (DeepSeek-V3). The architecture focuses on routing and latent/low-rank KV representations to limit attention I/O and per-token memory.

Latent attention reduces explicit key/value dimensionality before interaction.

Per-token complexity (paper claim / measured)

DeepSeekMoE reports substantial compute savings versus conventional GShard MoE: e.g., smaller-scale DeepSeekMoE claims ~40% of computations vs GShard equivalents at specific scales, and larger configurations report using ~28.5% (or even 18.2% reported in scaling experiments) of computations of comparable GShard architectures — i.e., large reductions in per-token or per-step compute via refined routing & expert partitioning. DeepSeek-V3 also couples MLA to reduce KV memory bandwidth.

Practical context tested (paper numbers / benchmarks)

DeepSeek variants and DeepSeekMoE experiments are reported across scales (2B → 145B→67B comparisons) and show comparable performance to denser baselines while reducing compute; DeepSeek-V3 papers/reporting include extended-context evaluations (hundreds of thousands of tokens in product/reporting comparisons).

Meituan — LongCat-Flash (LongCat-Flash family)

Source: LongCat-Flash technical report

Attention / MoE trick

Zero-computation Experts (a pool of experts that can be “no-op” and thus cost zero compute) + Shortcut-connected MoE (ScMoE) combined with Multi-Head Latent Attention (MLA) blocks. The architecture explicitly integrates zero-compute experts to let tokens dynamically consume variable compute budget.

Block-sparse attention limits token interaction to structured patterns. Zero-compute experts allow conditional capacity expansion without proportional compute cost.

Per-token complexity (paper claim / measured)

Paper gives explicit activation numbers: activates 18.6B–31.3B parameters (≈27B avg) per token out of a 560B total depending on token importance. The design objective is to lower average per-token compute by dynamically varying the number of active experts and by including zero-compute experts — measured throughput and TPS numbers are reported in the paper. The attention blocks use MLA + engineered FFN expert patterns to reduce memory/IO overhead (near-linear effective behavior in aggregate due to dynamic activation control).

Practical context tested (paper numbers / benchmarks)

LongCat-Flash reports agentic and long-context benchmarks with measured inference throughput and loss curves under matched compute budgets; training and inference experiments include hundreds of thousands to ~1M token regimes in system evaluations and provide measured TPS/inference cost figures in the deployment section.

NVIDIA — Nemotron 3 family

Source: Nemotron 3 technical report

Attention / MoE trick

Hybrid Mamba-Transformer MoE architecture with LatentMoE (paper: LatentMoE improves quality without sacrificing throughput) and MTP layers to accelerate long-form generation. The design converts some attention/FFN hotspots to Mamba-style efficient modules and mixes selective attention layers for throughput.

Many traditional attention layers replaced with state-space (Mamba-style) modules for linear sequence modeling.

Per-token complexity (paper claim / measured)

Nemotron 3 claims “best-in-class throughput” and explicitly targets inference-time tradeoffs: hybrid Mamba/MoE blocks enable higher throughput and effectively lower active per-token compute at scale. The report states support for up to 1M token context lengths and shows relative throughput improvements vs typical transformer MoEs (figures/tables in the report). LatentMoE is described as a way to gain accuracy without increasing inference cost.

Practical context tested (paper numbers / benchmarks)

Nemotron 3 reports evaluation on agentic and long-context tasks (reasoning, code, multi-turns) and provides throughput / tokens-per-second benchmarking; context length support claimed up to 1,000,000 tokens. The paper includes measured throughput comparisons (output tokens/s/GPU) vs other recent models.

OpenAI — GPT-OSS / GPT family (representative engineering-first approach)

Source (primary): GPT-OSS model card / OpenAI model docs; FlashAttention references.

Attention / MoE trick

OpenAI’s open-weight releases and large GPT family emphasize hardware-optimized dense attention kernels (FlashAttention family), grouped / multi-query attention patterns (MQA / GQA), and in some large releases Mixture-of-Expert transformer variants (gpt-oss model card notes mixture-of-expert configurations). The documented strategy is engineering/dense-first rather than inventing a new attention algebra. Extensive KV cache engineering.

No architectural removal of quadratic attention; efficiency comes from IO-aware kernels and memory layout optimization.

Per-token complexity (paper claim / measured)

The formal complexity is dense softmax O(N²), but FlashAttention (and FlashAttention-2) reduce IO and memory reads/writes, yielding large practical speedups (paper reports 2–4× or more runtime speedups vs optimized baselines for long sequences). OpenAI documents emphasize grouped/MQA patterns and kernel optimizations to reduce practical KV memory and latency.

Practical context tested (paper numbers / benchmarks) Open model cards and product docs report large context support (tens to low-hundreds of thousands of tokens in practice depending on model/service) and FlashAttention demonstrations show significant speedups at long sequence lengths in training/inference benchmarks. The gpt-oss model card also documents deployment/throughput tradeoffs for their open releases.


In Conclusion

Two distinct efficiency strategies dominate the frontier:

(A) structural conditional compute (MoE + zero-compute experts / top-k routing) and,
(B) I/O / kernel engineering (MLA, Lightning Attention, FlashAttention + MQA/GQA).

Papers above show both approaches reduce practical per-token compute but by different mechanisms (selective activation vs IO-aware exact kernels).

Papers frequently combine methods (e.g., MLA + ScMoE + zero experts in LongCat-Flash; Lightning Attention + MoE in MiniMax; LatentMoE + Mamba in Nemotron3) — gains come from joint architecture+serving innovations, not a single silver-bullet kernel.

Reported practical windows range from 32K → 128K (Qwen3, many large open/dense variants) up to ~1M tokens (MiniMax lineage, Nemotron3 claims, LongCat-Flash reporting), with measured throughput and FLOP comparisons in the papers.

For Long Context, current SOTA models (until 128k):

Beyond 128k context length, only 4 models survive (until 1M):
Kimi-linear-48B-a3b-instruct, Gemini-3, Nova-2 and Grok-4.1

Context Arena (Needles 4)


Ring Attention

Ring Attention (often referred to in the context of "Ring Transformers" or "Blockwise Transformers") is a technique designed to train and run inference on extremely long sequence lengths—up to millions of tokens—by distributing the computation across multiple GPUs or TPUs.
It prevents the quadratic memory bottleneck that is typical of standard Transformers.