๐Ÿ“š Transformers and Autoregressive Models

Course Link

This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.


This document summarizes key concepts from the lecture on Transformers and attention mechanisms. The focus is on understanding how Transformers, initially developed for time series modeling, have become a dominant architecture in various deep learning applications. We explore core concepts, motivations, advantages, limitations, and applications beyond time series.

1. โณ Time Series Modeling: Two Approaches

๐Ÿ”„ Recurrent Neural Network (RNN) - Latent State Approach

  • Concept:
    RNNs maintain a โ€œlatent stateโ€ that summarizes past information up to a given time point.
    ๐Ÿงฉ โ€œThe latent state (ht) acts as memory, accumulating information over time.โ€
  • Pros:
    • ๐Ÿ“œ Potentially infinite history: Can capture long past dependencies.
    • ๐Ÿ—œ๏ธ Compact representation: Entire history condensed into a single state.
  • Cons:
    • ๐Ÿงฎ Long compute path: Information from the distant past may vanish or explode through hidden states.
    • โŒ Difficult to incorporate long-term dependencies in practice.

๐ŸŽฏ Direct Prediction Approach

  • Concept:
    Directly maps input sequences to outputs without relying on latent states.
    ๐Ÿงฎ โ€œPredict each Yt as a function of Xt without embedding state.โ€
  • Pros:
    • โšก Shorter compute paths: Efficient information capture.
  • Cons:
    • โ›” No compact state representation: Entire history is processed for each prediction.
    • ๐Ÿ“ Finite history: Limited by input size.

2. ๐Ÿ› ๏ธ CNNs for Direct Prediction

  • Concept:
    Temporal Convolutional Networks (TCNs) use causal convolutions to ensure outputs depend only on past and current inputs.
  • Causal Convolutions:
    โฐ โ€œHidden states at time t depend only on states up to time t.โ€
  • Limitations:
    • ๐Ÿ” Limited receptive field: Small receptive field, requiring deeper networks.
  • Solutions:
    • ๐Ÿ“ˆ Dilated convolutions
    • ๐ŸŠ Pooling layers
      Each solution has trade-offs like parameter increase or sparse inputs.

3. ๐ŸŽฏ Attention Mechanisms

  • Concept:
    Attention weights and combines states, computing a weighted sum over time.
    ๐Ÿง‘โ€๐Ÿซ โ€œInitially used in RNNs to combine latent states over all time points.โ€

4. ๐ŸŒ Self-Attention

  • Concept:
    Attention where weights are determined by inputs (using queries, keys, and values).
    ๐Ÿ—๏ธ โ€œSelf-attention uses Q (queries), K (keys), and V (values) matrices.โ€
  • Operation:
    SelfAttention(Q, K, V) = softmax(QK^T / sqrt(d))V
  • Properties:
    • ๐Ÿ”„ Permutation Equivariance: Order of inputs doesnโ€™t affect result.
    • ๐ŸŒ Global Influence: Considers all time steps.
    • ๐Ÿ“Š Constant parameter count: Entire sequence processed without increasing parameters.
  • Compute Cost:
    • ๐Ÿ’ธ O(Tยฒd): Difficult to reduce.

alt_text


5. ๐Ÿš€ Transformer Architecture

  • Concept:
    Uses self-attention and feedforward layers to process sequences.
    ๐Ÿ”ง โ€œTransforms inputs to hidden states through a series of blocks.โ€
  • Transformer Block:
    • ๐Ÿ” Self-attention
    • โž• Residual connections
    • โš–๏ธ Layer normalization
    • ๐Ÿ”จ Feedforward network
  • Parallel Processing:
    ๐ŸŽ๏ธ Processes all time steps in parallel (unlike RNNs).
  • Advantages:
    • ๐ŸŒ Full receptive field in a single layer.
    • ๐Ÿ› ๏ธ Mixes entire sequence without increasing parameters.
  • Disadvantages:
    • โฑ๏ธ Autoregressive tasks affected by dependencies on future inputs.
    • ๐Ÿ”„ Permutation equivariance: No inherent data order capture.

alt_text


6. ๐Ÿ›ก๏ธ Addressing Limitations

  • Masked Self-Attention:
    ๐Ÿ”’ โ€œZero weight assigned to future steps to enforce causality.โ€
  • Positional Encodings:
    ๐Ÿ“Š โ€œSinusoidal encodings added to capture sequence order.โ€

7. ๐Ÿ“ˆ Transformers Beyond Time Series

  • Vision Transformers (ViTs):
    ๐Ÿ–ผ๏ธ Images represented as patch embeddings.
  • Graph Transformers:
    ๐Ÿ•ธ๏ธ Captures graph structures using modified attention.
  • Challenges:
    • ๐Ÿงฎ Efficient computation of attention matrices
    • ๐Ÿ“ Effective positional embeddings
    • ๐Ÿงฑ Mask matrix design