๐ Transformers and Autoregressive Models
This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.
This document summarizes key concepts from the lecture on Transformers and attention mechanisms. The focus is on understanding how Transformers, initially developed for time series modeling, have become a dominant architecture in various deep learning applications. We explore core concepts, motivations, advantages, limitations, and applications beyond time series.
1. โณ Time Series Modeling: Two Approaches
๐ Recurrent Neural Network (RNN) - Latent State Approach
- Concept:
RNNs maintain a โlatent stateโ that summarizes past information up to a given time point.
๐งฉ โThe latent state (ht) acts as memory, accumulating information over time.โ - Pros:
- ๐ Potentially infinite history: Can capture long past dependencies.
- ๐๏ธ Compact representation: Entire history condensed into a single state.
- Cons:
- ๐งฎ Long compute path: Information from the distant past may vanish or explode through hidden states.
- โ Difficult to incorporate long-term dependencies in practice.
๐ฏ Direct Prediction Approach
- Concept:
Directly maps input sequences to outputs without relying on latent states.
๐งฎ โPredict each Yt as a function of Xt without embedding state.โ - Pros:
- โก Shorter compute paths: Efficient information capture.
- Cons:
- โ No compact state representation: Entire history is processed for each prediction.
- ๐ Finite history: Limited by input size.
2. ๐ ๏ธ CNNs for Direct Prediction
- Concept:
Temporal Convolutional Networks (TCNs) use causal convolutions to ensure outputs depend only on past and current inputs. - Causal Convolutions:
โฐ โHidden states at time t depend only on states up to time t.โ - Limitations:
- ๐ Limited receptive field: Small receptive field, requiring deeper networks.
- Solutions:
- ๐ Dilated convolutions
- ๐ Pooling layers
Each solution has trade-offs like parameter increase or sparse inputs.
3. ๐ฏ Attention Mechanisms
- Concept:
Attention weights and combines states, computing a weighted sum over time.
๐งโ๐ซ โInitially used in RNNs to combine latent states over all time points.โ
4. ๐ Self-Attention
- Concept:
Attention where weights are determined by inputs (using queries, keys, and values).
๐๏ธ โSelf-attention uses Q (queries), K (keys), and V (values) matrices.โ - Operation:
SelfAttention(Q, K, V) = softmax(QK^T / sqrt(d))V - Properties:
- ๐ Permutation Equivariance: Order of inputs doesnโt affect result.
- ๐ Global Influence: Considers all time steps.
- ๐ Constant parameter count: Entire sequence processed without increasing parameters.
- Compute Cost:
- ๐ธ O(Tยฒd): Difficult to reduce.

5. ๐ Transformer Architecture
- Concept:
Uses self-attention and feedforward layers to process sequences.
๐ง โTransforms inputs to hidden states through a series of blocks.โ - Transformer Block:
- ๐ Self-attention
- โ Residual connections
- โ๏ธ Layer normalization
- ๐จ Feedforward network
- Parallel Processing:
๐๏ธ Processes all time steps in parallel (unlike RNNs). - Advantages:
- ๐ Full receptive field in a single layer.
- ๐ ๏ธ Mixes entire sequence without increasing parameters.
- Disadvantages:
- โฑ๏ธ Autoregressive tasks affected by dependencies on future inputs.
- ๐ Permutation equivariance: No inherent data order capture.

6. ๐ก๏ธ Addressing Limitations
- Masked Self-Attention:
๐ โZero weight assigned to future steps to enforce causality.โ - Positional Encodings:
๐ โSinusoidal encodings added to capture sequence order.โ
7. ๐ Transformers Beyond Time Series
- Vision Transformers (ViTs):
๐ผ๏ธ Images represented as patch embeddings. - Graph Transformers:
๐ธ๏ธ Captures graph structures using modified attention. - Challenges:
- ๐งฎ Efficient computation of attention matrices
- ๐ Effective positional embeddings
- ๐งฑ Mask matrix design
