TinyML GAN, Video, Point Cloud

Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.

Course link

I. 🌍 Overview

This lecture focuses on domain-specific optimization techniques for three computationally intensive areas of machine learning: Generative Adversarial Networks (GANs) for image generation, video understanding, and point cloud processing. The core principle is that these areas have inherent redundancies that can be exploited for efficiency gains. These redundancies include:

GANs: 2D spatial redundancy.
Videos: Temporal redundancy between frames.
Point Clouds: 3D spatial sparsity.

alt_text

The lecture explores techniques to reduce computation and memory usage in these areas, making them more deployable on resource-constrained devices.

II. ⚡ Efficient GANs

Generative Adversarial Networks (GANs) Background:

GANs consist of two networks: a generator (G) and a discriminator (D).
- Generator creates synthetic data.
- Discriminator tries to distinguish between real and generated data.
The goal is to train G to generate samples that D cannot distinguish from real data and train D to accurately distinguish real and generated data.
GANs are used for image generation.
There are two types of GANs:
1. Unconditional GANs: “Random noise feeds the G model generator with random noise to generate the output.”
2. Conditional GANs: “Both the discriminator and the generator are conditioned and provided labels, either class labels, segmentation maps, strokes, etc.”

GAN Compression:

Generative models like GANs are computationally expensive compared to recognition models, making compression a high priority.

“Such generative models are much more computationally expensive than these recognition models.”
The generator is the focus for compression because it is used at inference time.

“At generation time, only the generator will be used, the discriminator will not be used at inference time, so the focus of compression will be the generator part.”

Compression Techniques:

Three losses are used during compression:
1. Reconstruction Loss: Pixel-wise loss between generated and target images.
  
  “If we want to turn a horse into a zebra, we calculate the pixel-wise loss for each pixel to ensure it looks like a zebra.”
2. Distillation Loss: Matching intermediate feature maps of compressed and original models.
  
  “We match the intermediate feature maps, where the channel number is different, so we have to learn this one by one convolution to project between a large resolution with many channels and a smaller number of channels.”
3. Conditional GAN Loss: The original GAN loss to maintain the generative capability.
  
  “Finally, there’s the conditional GAN loss, the original GAN loss.”
A “once for all” approach is used by creating a super network with a wide range of channel numbers.

“We can use the ‘once for all’ approach between super networks, similar to what we did in lab three by changing the number of channels from small to large and randomly selecting the number of channels.”
Compression results in significant reductions in model size (e.g., 11x-21x) while maintaining good image quality.

“This is the GAN compression that makes it 11 times smaller and still achieves comparable quality. On the right side, the input is a horse and the output is a zebra. After compression by 21 times, the output is still roughly the same, which is pretty cool.”

AnyCost GAN:

AnyCost GAN addresses the need for faster image editing with GANs for interactive applications. The goal is to get very fast previews by reducing the computation needed while maintaining reasonable image quality.

“Given an input image, we want to project it into a hidden space, into a latent space.”
A sub-network (smaller, lower compute) is used for fast previews, and the full network is used for the final output.

“During editing, we run a low-cost smaller sub-network, and then we have a super network. During editing, we can run a smaller sub-network running at low cost.”
Training ensures the output is consistent across different resolutions and numbers of channels.

“We train the model to produce consistent output at different resolutions and different channel numbers.”
Multiple resolutions are supported using random sampling during training.

“At training time, we randomly sample different resolution outputs to the discriminator… the sampling approach is very helpful.”
Adaptive channel numbers are also sampled.

“The approach is to sample adaptive channel numbers during training.”
A distillation loss is added to maintain consistency between low and full channel outputs.
A generator-conditioned discriminator is introduced because a single discriminator does not provide effective feedback to different generator sizes.

“We propose this generator-conditioned discriminator so the discriminator should be aware of the configuration of the generator.”
Results: Interactive image editing capabilities, even on laptops.

Differentiable Augmentation:

GANs require large training datasets, which are expensive to collect and label. With limited data, the discriminator overfits, leading to poor generator performance.

“The discriminator will overfit to the small amount of training data, leading to worse generator performance because the discriminator is very poor, so the generator cannot work well.”
Traditional data augmentation (e.g., rotation, color jitter, translation, cutout) is used to prevent overfitting by generating more training examples, but it creates artifacts on generated images when applied to real images only.

“If we augment only the real images, the discriminator can take either the real or fake images, but the generated images suffer from the same artifacts.”
To combat this, augmentation transformations are applied to both real and generated images, and in a differentiable manner, so that gradients can be backpropagated to the generator.

“The solution is to apply augmentation to both the real and the fake images for both the discriminator and the generator.”
This approach significantly improves GAN performance with limited data.

“Immediately, it solves the problem of poor quality. The FID stays pretty stable.”

GAN Model/Technique	Description	Real-World Applications	Key Details
Unconditional GAN	Generates outputs (e.g., images) from random noise.	Basic image generation, though without control over specific attributes.	The generator is fed with random noise to produce outputs.
Conditional GAN	Both the generator and discriminator are conditioned on provided labels, such as class labels or segmentation maps.	Image editing based on user-provided conditions, like turning a horse into a zebra.	Enables more control over the generated output by conditioning the process on specific inputs or labels.
GauGAN	A conditional GAN model that generates images from strokes.	Drawing and real-time image rendering, e.g., drawing mountains, rivers, or grass.	Allows users to draw a sketch and have the GAN render a realistic image based on those strokes.
Pix2pix	A model that learns a mapping from input to output images.	Image-to-image translation, such as turning a horse into a zebra.	Trained with paired data to perform direct image-to-image transformations.
CycleGAN	A model used for unpaired image-to-image translation.	Translating images between domains (e.g., horses ↔ zebras) without paired training data.	Works with unpaired data, making it versatile for tasks where paired training data is not available.
AnyCost GAN	A GAN trained to produce consistent output at different resolutions and channel numbers.	Interactive image editing with fast previews and high-quality final images.	Utilizes a smaller sub-network for fast previews and the full model for high-quality outputs. Techniques include multi-resolution and adaptive channel training.
StyleGAN2	A GAN model where only the highest resolution is fed into the discriminator.	Generates high-quality images; also used as a baseline model for AnyCost GANs.	Uses a specific discriminator training method where only the highest resolution images are processed.
MSG-GAN	A GAN model where all resolutions are fed into the discriminator.	Generates high-quality images; also used as a baseline model for AnyCost GANs.	Uses a specific discriminator training method where all image resolutions are processed.
GAN Compression	Techniques for compressing the generator to speed up image generation.	Faster image generation for scenarios where large models would be too slow.	Reduces generator channels through neural architecture search, employing reconstruction, distillation, and conditional GAN losses.
Differentiable Augmentation	A technique to improve GAN training with limited data by augmenting real and fake images.	Effective GAN training with small datasets, preventing overfitting.	Applies augmentations (e.g., color jittering, translation) in a differentiable manner during generator and discriminator training.

🎥 III. Efficient Video Understanding

🌟 Background:

Applications: Critical for tasks like action recognition, scene understanding, and autonomous driving.
Core Challenge: Temporal modeling of changes across frames.

“The interesting new thing about video understanding compared with image is that we need to perform temporal modeling.”
Redundancy: Videos have significant static background redundancy.

“The table stays exactly the same; only the foreground (e.g., hand movements) changes.”

🖼️ 2D CNN-Based Methods:

How It Works:
- Sample frames, process them using 2D CNNs, and aggregate scores (e.g., average pooling, max pooling).
- Two-stream approach: Use spatial information and temporal info (e.g., optical flow), though optical flow computation is slow.
Strengths: Efficient and reuses existing 2D CNNs.
Weaknesses: Limited temporal modeling and slower performance with optical flow.

🔄 3D CNN-Based Methods:

How It Works:
- Convolve across spatial and temporal dimensions for joint modeling.
- Initialized using 2D CNN weights (e.g., ImageNet pre-trained) by inflation.
Strengths: Captures spatiotemporal information, including low-level details.
Weaknesses: High computation and storage costs.

⏳ Temporal Shift Module (TSM):

The Temporal Shift Module (TSM) is a technique for efficient video understanding that enables temporal modeling in 2D convolutional neural networks (CNNs) at a low computational cost. Here’s a breakdown of its key aspects:

Core Idea:

TSM’s core idea is to shift a portion of the channels in a feature map along the temporal dimension. This shift facilitates information exchange between neighboring frames, allowing the network to model temporal relationships without the need for computationally intensive 3D convolutions.

How it Works:

Shifting Channels:
TSM shifts a specific number of channels forward or backward in the temporal dimension. This involves moving the channel data from one time frame to another.
No Additional Parameters or FLOPs:
The shift operation itself doesn’t involve any learnable parameters or additional floating point operations (FLOPs). The data is just moved or the pointer is moved without moving the data. This makes it very efficient.
Information Mixing:
By shifting channels, the feature map at a given time step incorporates information from previous and/or subsequent time steps. A 1x1 convolution is then used to mix this information across the channel dimension.

Types of TSM:

Bi-directional TSM:
Shifts channels both from the past to the present and from the present to the future. This is suitable for offline video processing where the entire video sequence is available.
Uni-directional TSM:
Shifts channels only from the past to the present. This is used for online video understanding where processing must occur in a streaming fashion without access to future frames.

Implementation:

Ease of Integration:
TSM can be easily inserted into existing 2D CNN architectures. The shift operation can be implemented in just a few lines of code, as it does not require complex operations.
Shift Operation:
The shift operation is typically followed by a 2D convolution, which can be a standard convolution layer. It can be implemented by moving the pointer rather than moving the data, resulting in zero FLOPs and zero parameters.

Integration with 2D CNNs:

Offline TSM:
In offline processing, the frames are processed in batches, and TSM is applied to allow information exchange between frames. An identity layer is used to compensate for the loss of information due to shifting.
Online TSM:
For online processing, frames are processed sequentially in a streaming manner. TSM is implemented so that the information from previous time steps is carried forward. As each new frame is processed, some of its channels are shifted out and replaced with channels from the previous time step.

Advantages:

Efficiency:
TSM achieves performance comparable to 3D CNNs but with significantly lower computational costs, closer to the costs of 2D CNNs.
Flexibility:
TSM can be easily integrated into existing 2D CNN architectures, enabling the reuse of pre-trained models.
Effectiveness:
TSM effectively captures temporal dependencies in videos, leading to good performance in various video understanding tasks.

Performance:

Efficiency Comparison:
TSM consumes less computation than other models, such as the ECO family and Non-local I3D family, while also achieving better performance.
Higher Throughput & Lower Latency:
TSM has been shown to have higher throughput and lower latency compared to 3D CNNs, while maintaining competitive accuracy.
Scalability:
TSM can also be scaled up for large-scale video training, achieving a speed-up of 200x in training time.

Applications:

Action Recognition:
Identifying actions or activities within videos.
Fall Detection:
Detecting when a person has fallen in a video.
Video Recommendation:
Providing personalized video recommendations.
Autonomous Driving:
Improving detection with temporal cues in self-driving scenarios, where online processing is crucial.
Gesture Recognition:
Recognizing gestures, such as in the context of Google Maps navigation.

Scalability:

Low Latency for Low-Power Devices:
TSM has low latency for low-power deployment and can run in real-time on mobile devices.
Large-Scale Distributed Training:
TSM can be scaled up for large-scale distributed training, taking advantage of multiple GPUs to speed up training.

Dissecting TSM:

Channel Semantics:
TSM’s channels learn different semantics. For example, one channel might focus on moving something away, while another focuses on pushing something to the left.

🌍 IV. Efficient Point Cloud Understanding

🌟 Background:

What Are Point Clouds?
Collections of 3D points (x, y, z) with additional intensity values.

“Sparse compared to 2D images, with many empty regions when no object is present.”
Challenges:
- Irregular and sparse structure makes processing memory-intensive.
- Edge devices face additional computational resource constraints.

📦 Bottlenecks of Point and Voxel-Based Methods:

Voxel-Based Methods:
- Divides 3D space into grids (voxels), which regularizes data but increases memory usage and computation.
- Can cause information loss due to coarse quantization.
Point-Based Methods:
- Processes individual points directly, avoiding voxelization.
- Inefficient memory access during parallel computation.
  
  “We prefer continuous memory access without conflicts.”

⚡ Point-Voxel CNN (PVCNN):

Hybrid Approach: Combines strengths of voxel-based and point-based methods.
1. Voxel Branch: Aggregates neighborhood information with 3D convolutions.
  
  “Divides 3D space into voxels and convolves them for regular, structured data.”
2. Point Branch: Transforms individual point features to retain high resolution.
  
  “Focuses on pointwise operations to avoid information loss.”
Results: High efficiency and accuracy.

💨 Sparse Point-Voxel CNN (SPVCNN):

Introduces sparse convolution to the voxel branch:
- Reduces computation by processing only occupied voxels.
- Minimizes information loss from voxelization.

🤖 3D Neural Architecture Search:

Applies NAS on SPVCNN to optimize the architecture.
Uses a super network with elastic channel numbers and depth.
Achieves efficient architecture search with reduced computational costs.

🚘 Bird’s-Eye View (BEV) Fusion:

What It Does: Fuses multi-sensor data (e.g., cameras, LiDAR) in a shared bird’s-eye view space.

“Transforms sensor data into BEV for fusion.”
How It Works:
- Dense branch processes camera data; sparse branch processes LiDAR data.
- Supports multi-task learning (e.g., object detection and segmentation).
Results: Achieves top performance on benchmarks.

alt_text