๐Ÿš€ Modern LLM Inference โ€” Workloads, Bottlenecks, and Optimization Techniques

๐Ÿง  From defining inference to lossy/lossless acceleration and dynamic serving โ€” a full-stack view of how modern LLM inference actually works.

Course link

1. ๐Ÿ Introduction: What Is Inference and Why Itโ€™s Hard

โœ… What Is Inference?

Inference answers one fundamental question:

Given a trained model, how do we generate outputs given prompts?

But inference appears in many real scenarios:

  • ๐Ÿ’ฌ Chatbots
  • ๐Ÿ‘จโ€๐Ÿ’ป Code assistants (Cursor allegedly produces 1B accepted LOC/day)
  • ๐Ÿ“Š Batch data processing
  • ๐Ÿงช Evaluation & benchmarking
  • ๐Ÿง  Test-time compute (CoT / self-consistency)
  • ๐Ÿค– RLHF โ€” sampling trajectories

Training happens once.
Inference happens forever, at massive scales (OpenAI serves ~100B words/day).

Thatโ€™s why inference efficiency is existential.

โšก Key Inference Metrics

  1. TTFT (Time-to-first-token) โ€“ responsiveness
  2. Latency โ€“ speed of generating each token
  3. Throughput โ€“ total tokens/sec across all users

Training = parallel over sequence.
Inference = token-by-token, must process history โ†’ inherently sequential.

โžก๏ธ This is what makes inference memory-limited.

2. Why Transformer Inference Is Memory-Bound

The Transformerโ€™s efficiency changes dramatically between training and inference because of how sequences are processed.

alt_text

โœ… Training: Parallel Across the Entire Sequence

During supervised training, all tokens are visible at once.
This allows massive tensor parallelism through matrix multiplications.

โžก๏ธ Compute utilization is high.
โžก๏ธ Training is compute-limited.

โœ… Inference: Sequential, Auto-Regressive Generation

During inference, tokens must be generated one at a time: token_t โ†’ depends on โ†’ [token_1, token_2, โ€ฆ, token_t-1]

This sequential dependency prevents parallelism.

โžก๏ธ Generation cannot fully use GPU compute.
โžก๏ธ Becomes memory-limited instead of compute-limited.

โœ… The KV Cache (Stored in HBM)

To avoid recomputing attention over the whole prefix, models store:

  • B โ€” batch (number of sequences)
  • S โ€” sequence length
  • L โ€” number of layers
  • K โ€” number of attention heads
  • H โ€” hidden dimension

The KV cache stores an H-dim vector per (B, S, L, K).
It massively increases memory access during generation.

โœ… Two Stages of Inference

Prefill (Parallelizable โœ… Compute-limited)

  • Processes the prompt
  • Behaves like training
  • High arithmetic intensity โ†’ fast

Generation (Sequential โŒ Memory-limited)

  • One token at a time
  • Requires frequent KV-cache reads
  • Low arithmetic intensity โ†’ slow

3. ๐Ÿงฎ Arithmetic Intensity: The Core Metric

โœ… Definition

Arithmetic Intensity =
FLOPs / Bytes Transferred

  • High โ†’ compute-limited (good)
  • Low โ†’ memory-limited (bad)

H100 GPU has accelerator intensity โ‰ˆ 295.
You want AI > 295 to stay compute-bound.

โœ… Example: Matrix Multiplication $(X_{Bร—D} \times W_{Dร—F})$

alt_text

  • FLOPs = $(2BDF)$
  • Arithmetic intensity โ‰ˆ B

So:

  • If B โ‰ฅ 300 โ†’ compute-limited โœ…
  • If B = 1 (matrix-vector multiply) โ†’ intensity = 1 โŒ really bad

This is exactly the situation during inference generation.

1๏ธโƒฃ MLP Layers (Feed-Forward Networks)

alt_text

$[ \text{Intensity} \propto B \times T ]$

  • Prefill: $(T = S)$ โ†’ large โ†’ compute-limited โœ…
  • Generation: $(T = 1)$ โ†’ intensity = B โ†’ depends on concurrent requests

If batch size is small, MLP becomes memory-bound.

2๏ธโƒฃ Attention Layers (Self-Attention)

alt_text

$[ \text{Attention Intensity} = \frac{S + T}{S \cdot T} ]$

โœ… Prefill (T = S)

$[ \frac{2S}{S^2} = O(1/S) \quad \text{(Compute-limited if S large)} ]$

โŒ Generation (T = 1)

$[ \frac{S + 1}{S} \approx 1 ]$

Always โ‰ˆ 1 โ†’ deeply memory-limited.
And importantly:

โš ๏ธ No dependence on batch size (B). Batching cannot help the attention layer during generation.


โœ… MLP during generation

  • Intensity = B
  • Can be improved by adding more concurrent sequences

โŒ Attention during generation

  • Intensity = โ‰ˆ 1
  • Completely memory-bound
  • Batching does NOT help

Why batching fails for attention:

Layer Shared across batch? Effect
MLP Yes โ†’ same weights reused B increases FLOPs โ†’ higher intensity
Attention No โ†’ each sequence has its OWN KV Memory reads scale with B โ†’ FLOPs scale too โ†’ B cancels out

Thus:

Every sequence has its own unique KV cache.
Batching does not increase arithmetic intensity.

โœ… Summary Table

Stage Layer Type Arithmetic Intensity Limitation Does Batching Help?
Prefill MLP + Attention High (Bร—S or S/2) Compute-limited โœ… โœ… Yes
Generation (T=1) MLP = B Memory-limited if B small โœ… Yes
Generation (T=1) Attention โ‰ˆ 1 Memory-limited (worst) โŒ No

Inference suffers because:

  • ๐Ÿ”ฅ Prefill is compute-limited โ†’ fast
  • โ„๏ธ Generation is memory-limited โ†’ slow
  • โŒ Attention dominates and has intensity โ‰ˆ 1
  • โŒ Batching cannot fix attentionโ€™s memory costs
  • โœ… Therefore, the KV cache becomes the primary bottleneck.

This is why modern research focuses on:

  • shrinking the KV cache (GQA, MLA, CLA)
  • replacing it entirely (Mamba/SSMs)
  • paging it efficiently (vLLM)

Reducing attentionโ€™s memory traffic = the key to fast inference.

3. โฑ Latency vs Throughput: Core Trade-off

Using Llama 2 13B on H100:

โœ… B = 1 โ†’ Best latency

  • ~8 ms/token
  • ~124 tok/sec

โœ… B = 64 โ†’ High throughput

  • Much higher throughput
  • Higher latency
  • Higher memory usage
  • KV cache grows linearly with B

Batching improves throughput but hurts latency.

โœ… Want both?

Run M model replicas โ†’ multiply throughput without hurting latency.

๐Ÿ“ฆ 4. Techniques for Shrinking the KV Cache (Motivation, Methods, and Trade-offs)

The need to shrink the Key-Value (KV) Cache arises from a simple but critical fact:

โœ… Inferenceโ€”especially sequential generationโ€”is fundamentally memory-limited, not compute-limited.
โœ… KV cache size directly determines memory bandwidth โ†’ which determines latency and throughput.

Because the KV cache grows with sequence length, layers, heads, and hidden dimension, it becomes the dominant memory bottleneck.
Thus, shrinking the KV cache is one of the most effective architectural optimizations for speeding up inference without retraining an entirely new model.

Below are the main techniques used to reduce KV cache size.

4.1 Grouped-Query Attention (GQA)

GQA reduces the number of key/value heads, keeping the number of query heads the same.

โœ… Key Ideas

  • Standard Multi-Head Attention (MHA): Query heads = # Key heads = # Value heads = N
  • GQA: N query heads share K key/value heads (K < N)

Each KV head serves a group of Q heads โ†’ Grouped-Query Attention.

โœ… KV Cache Reduction

Reducing KV heads from N โ†’ K shrinks the cache by a factor of:

$[ \text{Reduction Factor} = \frac{N}{K} ]$

โœ… Efficiency Gains

Fewer KV tensors โ†’ fewer memory loads โ†’ lower latency + higher throughput.

Example:
Switching a Llama 2 13B model to a 1:5 GQA ratio enabled batch sizes like B = 256, dramatically improving throughput.

โœ… Accuracy

Nearly identical to full MHA.
Llama 3 adopts GQA for this reason.


4.2๏ธ. Multi-Head Latent Attention (MLA)

alt_text

MLA reduces the dimensionality of the key and value vectors rather than reducing the number of KV heads.

Key Ideas

  • Original KV dimension: large (e.g., 16,384)
  • MLA projects KV vectors into a smaller latent dimension: e.g., 16,384 โ†’ 512
  • KV cache shrinks proportional to the dimension reduction:

$[ \text{Reduction} \propto \frac{\text{old dim}}{\text{new dim}} ]$

DeepSeek V2 uses: 512 (latent) + 64 (for RoPE compatibility) = 576 total dims

4.2.1. Why MLA Compresses K/V but Not Q

In standard attention, logits are computed as:

$[ \text{logits} = Q \cdot K^T ]$

This requires Q and K to have the same dimensionality.

However, MLA reduces the dimensionality of K and V, not Q:

  • K/V stored in KV cache โ†’ major memory sink โ†’ must be compressed
  • Q is never stored in the KV cache โ†’ compressing it saves nothing

Therefore:

  • โœ… Compressing K/V reduces memory (KV cache)
  • โŒ Compressing Q does NOT reduce memory (Q is computed per token and discarded)
  • โœ… But Q must be projected into the same latent dimension as K for multiplication

Thus the model keeps:

  • Q_full: full-dimensional query (e.g., 4096)
  • K_latent: stored in KV cache (e.g., 512 or 576)
  • V_latent: stored in KV cache (same latent dimension)
  • Q_latent: computed temporarily for attention

This leads to the next component.

4.2.2. Introducing a New Qโ†’Latent Projection Matrix

Because Q_full cannot directly multiply with K_latent:

  • Q_full: 4096 dims
  • K_latent: 512 dims
  • โŒ Dot product impossible

MLA introduces a new learned projection:

$[ W_{Q \rightarrow latent} \in \mathbb{R}^{d_{full} \times d_{latent}} ]$

to produce:

$[ Q_{\text{latent}} = Q_{\text{full}} \cdot W_{Q\rightarrow latent} ]$

Now:

  • Q_latent: 512 dims
  • K_latent: 512 dims

Dot-product is valid:

$[ \text{logits} = Q_{\text{latent}} \cdot K_{\text{latent}}^T ]$

Important:
โœ… Q_latent is NOT stored
โœ… Only temporary
โœ… Does not increase KV cache size
โœ… MLA still โ€œcompresses only K/Vโ€ from a memory perspective

4.2.3. Why K/V Can Be Compressed but Q Must Stay Full

This follows deep attention-geometry principles:

  • Q encodes the layerโ€™s computation
    โ†’ must retain full expressiveness
    โ†’ compressing Q collapses layer depth and harms accuracy heavily

  • K/V encode token memory
    โ†’ can be compressed without destroying computational meaning
    โ†’ attention distribution remains workable if geometric relationships are preserved

Thus MLA design:

  • โœ… Full-dim Q (keeps accuracy)
  • โœ… Compressed K/V (saves KV-cache)
  • โœ… Align via Qโ†’latent projection

4.2.4. How RoPE Works With MLA (Full Explanation)

RoPE (Rotary Position Embedding) applies rotations to Q and K in their full dimension only:

$[ Q_{\text{rope}} = \text{RoPE}(Q_{\text{full}}) ]$ $[ K_{\text{rope}} = \text{RoPE}(K_{\text{full}}) ]$

Why not apply RoPE directly in latent space?

  • Projection destroys RoPEโ€™s sinusoidal subspaces
  • Latent space does not preserve frequency structure
  • Positional geometry must be applied before dimensionality reduction

So MLA applies RoPE before compression:

  1. Compute Q_full, K_full
  2. Apply RoPE to Q_full and K_full
  3. Project K_full into latent space
  4. Project Q_full into latent space (for compute only)

4.2.5. Why MLA Adds Back Extra RoPE Dimensions (+64 Example)

When K_full (4096 dims) is projected into K_latent (e.g., 512 dims), the projection inevitably:

  • Loses high-frequency RoPE components
  • Blurs precise rotational structure
  • Weakens relative positional encoding

DeepSeek V2 fixes this by allocating extra RoPE-preservation dimensions:

$[ K_{\text{latent}} = K_{\text{compressed}}^{(512)} \oplus K_{\text{pos}}^{(64)} ]$

  • Total latent K dimension becomes: 512 compressed dims + 64 RoPE-pos dims = 576 total dims
  • These 64 dims explicitly store RoPE harmonics that would otherwise be destroyed by projection.
  • Thus MLA latent K = compressed memory + preserved positional structure.
  • This keeps positional awareness strong even after compression.

Only K needs additional RoPE dimensions because:

  • RoPE affects attention geometry, and this geometry depends on Q and K being in the same positional space.
  • Q does NOT get compressed, so it keeps its full RoPE structure and does NOT lose positional information.
  • K DOES get compressed, so it loses some high-frequency RoPE information โ€” that must be restored.

4.2.6. Complete MLA Pipeline (Concrete Numerical Example)

Assume:

  • Full dimension: 4096
  • Latent dimension: 512
  • Extra RoPE positional dims: 64

Step-by-step:

  1. Compute full Q/K/V
    • Q_full = X โ‹… W_Q โ†’ (4096)
    • K_full = X โ‹… W_K โ†’ (4096)
    • V_full = X โ‹… W_V โ†’ (4096)
  2. Apply RoPE in full dimension
    • Q_rope (4096)
    • K_rope (4096)
  3. Project K/V into latent
    • K_core = Linear(K_rope) โ†’ (512)
    • V_latent = Linear(V_full) โ†’ (512)
  4. Extract RoPE harmonics
    • K_pos = selected positional dims โ†’ (64)
  5. Form final latent K
    • K_latent = concat(K_core (512), K_pos (64)) โ†’ (576)
  6. Project Q_rope into latent (temporary)
    • Q_latent = Linear(Q_rope) โ†’ (576)
  7. Compute attention in latent space
    • logits = Q_latent โ‹… K_latent^T
    • output = softmax(logits) โ‹… V_latent
  8. KV cache stores only:
    • K_latent (576)
    • V_latent (512)
  9. Q_latent is discarded (never cached).

4.2.7. Final Takeaways

  • โœ… MLA compresses K/V for KV-cache savings
  • โœ… MLA introduces a new Qโ†’latent projection for dimension alignment
  • โœ… Q remains full-dimensional for accuracy reasons
  • โœ… RoPE must be applied before the latent projection
  • โœ… MLA adds extra RoPE dims to recover positional information lost during compression
  • โœ… Only K/V latent are stored; Q_latent is temporary

This design balances:

  • Memory efficiency (smaller KV cache)
  • Computational compatibility (dimension alignment)
  • Positional accuracy (RoPE preservation)
  • Model quality (full-dim Q expressiveness)

MLA is therefore an elegant solution that compresses what matters (KV cache) while preserving what must remain expressive (Q).


4.3๏ธ. Cross-Layer Attention (CLA)

CLA reduces KV cache size by sharing the same KV projections across multiple layers.

alt_text

โœ… Key Ideas

  • Standard Transformer: each layer has its own K and V projections.
  • CLA: reuse the same KV projections across layers.

โœ… Analogy

  • GQA = share KV across heads (within a layer)
  • CLA = share KV across layers (vertical sharing)

Reduced memory consumption with minimal accuracy loss, improving the perplexity-vs-KV-size trade-off.

โ“ Subsection: Why Cross-Layer Attention Shares K/V but Not Q

Cross-Layer Attention (CLA) reduces KV cache size by sharing Key (K) and Value (V) projections across layers, but it does not share Query (Q) projections.
This design choice is intentional and rooted in the fundamental roles that Q, K, and V play in attention.

1. Different Roles: Q = Computation, K/V = Memory

In self-attention:

$[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V ]$

Each projection matrix has a different semantic role:

K (Keys) and V (Values) represent sequence memory

  • Encode what each token is
  • Store token identity and content
  • Act like read-only memory
  • Should be consistent across layers
  • Safe to share across layers

Q (Queries) represent layer-specific computation

  • Determine what each layer wants from memory
  • Encode how the layer interprets the sequence
  • Are unique to each depth in the network
  • Must not be shared across layers
  • Sharing Q destroys the layer hierarchy

Why Sharing K/V Works

K and V:

  • Represent static properties of the input
  • Do not contain layer-specific meaning
  • Can be reused like a shared memory bank
  • Allow each layer to interpret the same memory differently using its own Q

Thus, sharing KV reduces cache size without damaging the computational diversity of the layers.

Why Sharing Q Breaks the Model

Sharing Q would force all layers to ask the same question about the memory.

This would:

  • Collapse the differences between layers
  • Cause layers to behave identically
  • Destroy depth expressiveness
  • Dramatically increase perplexity
  • Make the deep Transformer behave like a repeated shallow block

Empirical results show:

  • Sharing KV works well โœ…
  • Sharing Q causes large accuracy drops โŒ
  • Sharing QKV destroys the model โŒโŒ

Analogy: Memory vs. Instructions

  • K/V = memory
    Shared, stable, consistent across layers.

  • Q = instructions
    Each layer needs its own instructions to compute something different.

You can share memory across modules,
but not the program code.

Summary Table

Projection Role Share Across Layers? Reason
Q Layer computation / interpretation โŒ No Must vary per layer to maintain depth & expressiveness
K Token identity โœ… Yes Represent sequence memory, not layer-specific
V Token content โœ… Yes Safe shared memory
All QKV Entire attention block โŒ No Collapses layers, destroys model

CLA shares K/V because they are shared-memory representations of the sequence, but it cannot share Q because Q encodes layer-specific computation.
Sharing Q would collapse all layers into the same function, destroying depth and severely harming accuracy.

4๏ธ.4. Local Attention

Local attention restricts each token to attend only to the most recent K tokens.

โœ… Key Ideas

Instead of full attention over all past tokens: Attend only to last K tokens

โœ… KV Cache Effect

Any KV entry older than the local window can be discarded.

This makes KV cache size:

$[ \text{O(window size)} = \textbf{O(1)} ]$

instead of:

$[ \text{O(sequence length)} ]$

โœ… Pros

  • KV cache does not grow with sequence length โ†’ huge savings
  • Useful for very long contexts (100k+ tokens)

โœ… Cons & Hybrid Layers

alt_text

Local attention alone harms long-range reasoning.
Solution:

  • Interleave global attention layers periodically
    (e.g., 1 global layer for every 5โ€“6 local layers)

โœ… Summary: The Four Strategies for Shrinking the KV Cache

Strategy Main Idea KV Cache Saved By Trade-offs
GQA Fewer KV heads N/K reduction Minimal accuracy loss
MLA Smaller KV dimensions Dim reduction (e.g., 16kโ†’512) Slight RoPE modifications
CLA Share KV across layers Multi-layer reuse Architecture complexity
Local Attention Attend to last K tokens only Constant-sized KV Weak long-range modeling; use hybrids

Shrinking the KV cache is one of the most powerful levers for improving inference performance because:

  • โœ… Generation is memory-bound
  • โœ… KV cache dominates memory reads
  • โœ… Reducing KV memory โ†’ directly reduces latency
  • โœ… Enables larger batch sizes โ†’ increases throughput

These โ€œarchitectural shortcutsโ€ must be carefully tuned to balance:

speed ๐ŸŽ๏ธ vs memory ๐Ÿ’พ vs accuracy ๐Ÿ“‰.

But when done correctly, they unlock massive improvements in real-world LLM serving.

๐Ÿ“ 5. Architectural Alternatives to Transformers

The motivation for exploring architectural alternatives arises from a fundamental limitation: Transformers paired with autoregressive generation are inherently memory-limited during inference.
The architecture was originally optimized for training efficiency, not inference.
As a result, the full attention mechanism forces models to maintain a growing Key-Value (KV) cache, which becomes the primary bottleneck.

To overcome these structural constraints, researchers pursue radical architecture shifts that avoid or eliminate the sequential, memory-heavy properties of Transformers. Two main directions have emerged:
โœ… State-Space Models (SSMs)
โœ… Diffusion Models for Language

5.1. State-Space Models (SSMs)

State-Space Models originate from signal processing and control theory, and were initially designed to model long sequences with O(N) (linear) complexity instead of O(Nยฒ).

โœ… Core Motivation

The earliest successes of SSMs focused on long-context tasks, achieving remarkable efficiency. However, they initially struggled on general language modeling.

โœ… Evolution of SSMs

  • S4
    Early SSM models like S4 implemented classic linear dynamical systems and excelled at synthetic long-range tasks.
  • Associative Recall Problem
    Early SSMs failed at associative recall โ€” the ability to retrieve a specific memory far in the past.
    This capability is essential for language modeling but was missing in classical SSMs.
  • Mamba
    Introduced input-dependent parameters, allowing SSMs to dynamically adapt to sequence content.
    Mamba reached Transformer-level quality up to the 1B scale.
  • Jamba (52B MoE)
    Combines SSM and Transformer layers in a 1:7 ratio.
    Validated that SSM+Transformer hybrids can scale to tens of billions of parameters.

โœ… Inference Benefit: O(1) State Instead of O(T) KV Cache

SSMs replace the growing KV cache (size O(T)) with a constant-size state (size O(1)).
This dramatically improves inference efficiency:

  • Memory stays constant
  • Latency does not grow with sequence length
  • More scalable for very long contexts

5.2. Linear Attention โ€” From Softmax to Associative Kernels (O(N) Inference)

Linear attention replaces softmax with an associative kernel so we can reorder computation, pre-accumulate K/V once, and evaluate queries in O(1) per token. Itโ€™s not equal to softmax; itโ€™s an approximation designed for speed and constant memory.

๐Ÿ”Ž What Problem Are We Solving?

Standard self-attention is: $[ \text{Attn}(Q,K,V)=\text{softmax}(QK^\top)V ]$

  • Requires forming an $(N\times N)$ score matrix (pairwise Qโ€“K comparisons) โ†’ O(Nยฒ) compute and memory traffic.
  • In inference, even with a KV cache, each new token still attends to all past tokens โ†’ O(N) per step, O(Nยฒ) total.
  • Attention during generation is memory-bound; bandwidth dominates latency.

Goal: make attention associative so we can reorder and avoid building the (N\times N) matrix.

๐Ÿง  Core Idea: Replace Softmax with an Associative Kernel

Softmax uses the exponential kernel (e^{q^\top k}), which is not associative (you canโ€™t regroup terms).
Linear attention approximates this kernel with one that factorizes: $[ K(q,k)\approx \phi(q)^\top \phi(k) ]$ If this holds, then: $[ \text{Attn}(Q,K,V) \approx \frac{\phi(Q)\left(\phi(K)^\top V\right)}{\phi(Q)\left(\phi(K)^\top \mathbf{1}\right)} ]$ Now we can reorder: 1) Prefix accumulation: $(S=\sum_{j=1}^N \phi(k_j)V_j^\top)$ and $(z=\sum_{j=1}^N \phi(k_j))$
2) Query eval: $(\text{out}_i=\frac{\phi(q_i)S}{\phi(q_i)z})$

  • Per-token cost: O(1)
  • Total: O(N)
  • Memory: constant state (no growing KV cache)

โš ๏ธ This is an approximation, not exact softmax.

๐Ÿงฉ Where Do the Feature Maps $(\phi(\cdot))$ Come From?

We want $(\exp(q^\top k)\approx \phi(q)^\top\phi(k))$. Common constructions:

  • Taylor / Polynomial expansion (intuition):
    $(e^{x}\approx 1+x+\tfrac{x^2}{2!}+\cdots)$.
    Each $((q^\top k)^m)$ can be written as a dot product in a lifted space (e.g., $(q^{\otimes m})$ with $(k^{\otimes m})$).

  • Random Features for Softmax (Performer / FAVOR+):
    $(\phi(x)=\exp(-|x|^2/2)\exp(Wx))$, with rows of $(W)$ sampled from a Gaussian; Monte-Carlo approximates $(\exp(q^\top k))$.

  • Positive kernels (e.g., ReLU maps):
    Cheap $(\phi(x)=\text{ReLU}(Wx)+1)$; weaker approximation but fast.

The better the feature map, the closer we are to softmax behaviorโ€”at higher compute for (\phi).

๐Ÿ” Why Reordering Matters for Inference

Softmax (exact): must compute $({q_t^\top k_j}{j\le t})$ every step โ†’ O(t) per step โ†’ O(Nยฒ) total.
Linear attention (approx.): maintain a running state: $[ S_t=S
{t-1}+\phi(k_t)V_t^\top,\quad z_t=z_{t-1}+\phi(k_t) ]$ then $[ \text{out}_t=\frac{\phi(q_t)S_t}{\phi(q_t)z_t} ]$ No loop over history โ†’ O(1) per step, O(N) total, constant memory.

๐Ÿงฎ Minimal Pseudocode (Single Head, Causal)

# Running state (initialized once)
S = 0         # shape: [d_phi, d_v]
z = 0         # shape: [d_phi]

for t in range(N):            # streaming tokens
    k_t, v_t = K[t], V[t]
    phi_k = phi(k_t)          # [d_phi]
    S += outer(phi_k, v_t)    # accumulate phi(k_t) * v_t^T
    z += phi_k                # accumulate for normalization

    q_t = Q[t]
    phi_q = phi(q_t)          # [d_phi]
    out_t = (phi_q @ S) / (phi_q @ z + 1e-9)

๐Ÿ†š Linear Attention vs. FlashAttention (Online Softmax)

FlashAttention โ€” exact softmax, with kernel tiling/fusion to cut HBM traffic; still O(Nยฒ) and needs a KV cache. (Systems/kernel optimization.)

Linear Attention โ€” changes the math (kernel approximation) โ†’ O(N), constant state, no KV cache; but approximate.

When to use:

  • Use FlashAttention when you need exactness.
  • Use Linear Attention when you need sub-quadratic scaling / streaming with acceptable approximation error.

โš–๏ธ Qualityโ€“Speed Tradeoffs

Pros: O(N) compute, constant memory, great for very long context & streaming; RNN-like updates.
Cons: Not identical to softmax; may need hybrids (e.g., mix linear + local + a few global softmax layers) to match quality.

๐Ÿ’ก Intuition Recap

  • Kernel = similarity function. Softmax uses the exponential kernel $(e^{q^\top k})$.
  • Associativity is the key. If $(K(q,k)=\phi(q)^\top\phi(k))$, we can pre-sum K/V once and apply Q after, yielding O(N).
  • Linear attention $(\neq)$ softmax. Itโ€™s a principled approximation enabling reordering and linear complexity.

โ“FAQ

Q: Is linear attention just โ€œreorderingโ€ softmax?
A: No. Reordering is only valid after replacing softmax with an associative kernel approximation.

Q: If K,V arrive one-by-one, do we still benefit?
A: Yes. We update a fixed-size state $((S, z))$ once per token and never revisit earlier tokens.

Q: Which $(\phi)$ should I use?
A: Performer/FAVOR+ random features are a strong default; Taylor/polynomial features are instructive; ReLU-style are fastest but weaker.

Q: Do current frontier LLMs use linear attention?
A: Most closed, general-purpose LLMs still use softmax attention (often with FlashAttention kernel

5.3. Diffusion Models for Language

Diffusion models break away entirely from the Transformer paradigm by abandoning autoregressive generation.

โœ… Core Idea

Adapted from image generation:

  1. Start with full-sequence noise
  2. Iteratively denoise/refine for several steps
  3. Produce the complete output sequence

โœ… Key Benefits

  • Parallel Token Generation
    All tokens are generated in parallel โ€” no sequential dependency.
  • Iterative Refinement
    The model improves the sequence step-by-step instead of autoregressively.

โœ… Inference Benefit: Massive Throughput

Because generation is fully parallel, GPUs can be fully saturated:

  • Higher tokens/sec than autoregressive Transformers
  • Excellent performance on coding benchmarks
  • Even with multiple refinement steps, total time can still be lower

This parallelism sidesteps the core bottlenecks of:

  • Autoregressive sequence dependence
  • KV cache growth
  • Memory-bound attention

โœ… Summary

Architectural alternatives to Transformers represent a fundamentally different approach to solving inference bottlenecks:

  • SSMs remove the growing KV cache by maintaining a constant state
  • Linear/local attention reduces complexity with lightweight approximations
  • Diffusion models eliminate autoregression entirely by generating sequences in parallel

Unlike KV-cache optimization techniques like GQA or MLA, these models reimagine the entire computation and generation process, offering potentially dramatic improvements in inference efficiency and scalability.

๐Ÿ’พ 6. Quantization + Pruning

Quantization

Reducing precision:

  • bf16 (2 bytes)
  • int8 (1 byte)
  • int4 (0.5 bytes)

Variants:

  • LLM.int8() (handles outliers)
  • Activation-aware quant (int3, int2)
  • AWQ (SmoothQuant-style)

Pruning

  • Remove layers/heads/neurons
  • Retrain via distillation
  • Example: 15B โ†’ 8B with tiny accuracy loss

7. โœ… Speedups Without Accuracy Loss (Lossless)

๐Ÿ”ฎ 7.1. Speculative Decoding

Use fast draft model (p) to generate multiple candidate tokens.
Use slow target model (q) to verify them in parallel.

Procedure:

  1. Draft model proposes tokens
  2. Target model evaluates (prefill = faster)
  3. Accept/reject using corrected probability q/p

Used in:

  • Medusa (multi-token draft heads)
  • EAGLE (draft model leverages target model features)

Extreme example:
70B target + 8B draft โ†’ large speedup, no accuracy loss.

8. ๐Ÿงต Handling Dynamic Real-World Inference Workloads

Real-world LLM inference traffic is fundamentally different from training.
Training processes dense blocks of tokens with fixed Batch Size ร— Sequence Length, but production workloads behave like a ragged array:

  • Requests arrive and finish at different times
  • Many requests share prefixes (system prompts, multi-sample decoding)
  • Sequence lengths vary widely, making padding inefficient

To handle this unpredictable, heterogeneous traffic efficiently, we need specialized systems-level techniques. Two of the most important are Continuous Batching and Paged Attention.

8.1. Continuous Batching (Iteration-Level Scheduling)

Continuous batching ensures high GPU utilization by never waiting for a batch to finish before adding new requests.

โœ… The Idea

During decoding, generation occurs one token per step.
After generating a token for all active sequences:

  1. The worker returns control to the scheduler
  2. The scheduler adds new incoming requests immediately
  3. The next decode step runs with the expanded batch

This prevents the GPU from ever running โ€œhalf empty.โ€

โœ… Selective Batching for Heterogeneous Lengths

Different sequences have different lengths. Traditional batching struggles because tensors must be same-sized. Continuous batching handles this via selective batching:

1. MLP Layers (Most Compute-Heavy)
  • MLP weights are shared across all sequences
  • Tokens from different sequences can be flattened and concatenated
    • Example: sequences of lengths 3, 9, 5 โ†’ flattened to 17 tokens
  • All tokens are processed together in one large batch
  • Huge throughput gains because MLPs dominate compute
2. Attention Layers
  • Attention cannot be merged, because each sequence has its own KV cache
  • Must process each sequence separately for attention heads
  • But because MLP is the bottleneck, selective batching still yields major gains

Result:
High throughput even when requests are messy, diverse, and unpredictable.

8.2. Paged Attention (vLLM-Style Memory Management)

Paged Attention solves the KV-cache fragmentation problem using ideas from operating-system paging.

โœ… Why Fragmentation Happens

Traditional KV-cache allocation requires a single contiguous block for each request:

  • Internal fragmentation:
    If a user expects 512 tokens but only generates 80, the leftover KV space is wasted.
  • External fragmentation:
    Gaps appear between allocated regions, leaving un-usable โ€œwhite space.โ€

This leads to massive memory waste when many sequences have different lengths.

โœ… The Solution: Paging the KV Cache

Paged Attention divides each sequenceโ€™s KV cache into non-contiguous blocks and places them wherever space is available.

1. Non-Contiguous KV Blocks

  • KV cache is broken into fixed-size blocks
  • Blocks can be stored anywhere in memory
  • No need for one large contiguous region
  • Completely eliminates internal & external fragmentation

2. Prefix Sharing (Copy-on-Write)

Many sequences share initial prefixes:

  • system prompts
  • chat histories
  • multiple decoding samples from the same prompt

Paged Attention enables block-level sharing of these prefixes:

  • Blocks include a reference counter
  • Multiple sequences point to the same KV blocks
  • When sequences diverge:
    • Only the modified block is copied
    • Others remain shared

This saves:

  • KV memory
  • Compute (prefix need not be recomputed)
  • Latency for multi-sample generation

8.3. Additional vLLM Optimizations

Paged Attention is part of a broader set of optimizations enabling high-throughput serving:

  • FlashAttention / FlashDecoding kernels
    Highly optimized attention kernels for fast memory-bound operations
  • Kernel fusion:
    Merges block read + attention into single kernels to reduce overhead
  • CUDA Graphs:
    Pre-captures execution graphs to reduce kernel launch overhead
  • Efficient scheduling:
    Ensures continuous batching with minimal latency

โœ… Summary

Handling dynamic, real-world inference requires more than just model optimizations โ€” it demands systems engineering:

  • Continuous batching keeps GPUs busy despite ragged, asynchronous workloads
  • Paged Attention eliminates KV-cache fragmentation and allows prefix sharing
  • Kernel-level optimizations (FlashAttention, CUDA Graphs) reduce memory traffic & overhead

Together, these techniques allow LLM serving frameworks like vLLM to achieve far higher memory efficiency and throughput on unpredictable, heterogeneous inference traffic.

9. โœ… Final Summary

Inference โ‰  training.

Inference is:

  • โœ… sequential
  • โœ… memory-bound
  • โœ… dynamic
  • โœ… KV-cache dominated

Optimizations fall into 3 buckets:

โœ… Lossy

  • KV cache reduction (GQA/MLA/CLA)
  • New architectures (Mamba, linear attention)
  • Quantization / pruning

โœ… Lossless

  • Speculative decoding
  • Parallel verification
  • Medusa/EAGLE

โœ… System-level

  • Continuous batching
  • Paged attention
  • Multi-replica parallelism

The biggest wins increasingly come from rethinking architectures (SSMs, hybrids, diffusion).