GPU Architecture

GPU Architecture

GPU Architecture refers to the hardware design of Graphics Processing Units, particularly NVIDIA GPUs used for deep learning. Modern GPUs feature a hierarchical structure of Streaming Multiprocessors (SMs) containing thousands of parallel processing cores, multiple memory levels with vastly different bandwidths, and specialized units for matrix operations.

Massively Parallel, Memory-Constrained

GPUs achieve high throughput by running thousands of threads simultaneously. However, the real bottleneck is often memory bandwidth—the GPU can compute faster than it can fetch data. Optimization means keeping data in fast memory (SRAM) and minimizing trips to slow memory (HBM).

Key Components

Streaming Multiprocessors (SMs)

Streaming Multiprocessor

An SM is the fundamental compute unit on NVIDIA GPUs. Each SM contains:

  • CUDA Cores: General-purpose ALUs for scalar/vector operations (64-128 per SM)
  • Tensor Cores: Specialized matrix-multiply units for accelerated FP16/BF16 matmul (4-8 per SM)
  • Shared Memory (SRAM): Fast, programmer-managed cache (~192KB per SM)
  • Register File: Per-thread ultra-fast storage

Memory Hierarchy

Memory LevelCapacityBandwidthLatency
Registers~256KB per SM~20 TB/sFastest
Shared Memory (SRAM)~192KB per SM~15-19 TB/sVery fast
L2 Cache~40MB~4-5 TB/sMedium
HBM (Global Memory)40-80GB~1.5-2 TB/sSlowest

HBM vs SRAM

Memory Bandwidth Gap

SRAM is roughly 10x faster than HBM, making data locality critical for performance.

HBM (High Bandwidth Memory):

  • Large capacity (40-80GB)
  • Slower access (~2 TB/s)
  • Used for storing model weights, activations, gradients

SRAM (Shared Memory):

  • Small capacity (~192KB per SM)
  • Much faster access (~19 TB/s)
  • Programmer-managed, used for tiling and data reuse

Memory Bandwidth Bottleneck

The Real Limiting Factor

For many neural network operations (elementwise ops, reductions, attention), the GPU spends more time waiting for data than actually computing. The key optimization strategy is maximizing data reuse—load once into SRAM, compute many operations, then write back.

Arithmetic Intensity

Arithmetic Intensity

  • Low intensity (e.g., elementwise ops): Memory-bound → optimize memory access
  • High intensity (e.g., large matmuls): Compute-bound → use Tensor Cores

Bottleneck Types

BottleneckSymptomSolution
Compute-boundHigh SM utilizationTensor Cores, better algorithms
Memory-boundSaturated HBM bandwidthKernel Fusion, tiling, data reuse
Overhead-boundMany small kernelsFuse kernels, reduce launches

Key Properties

  • Parallelism: Thousands of threads execute simultaneously across SMs
  • SIMT Execution: Single Instruction, Multiple Threads—threads in a warp execute the same instruction
  • Coalesced Memory Access: Adjacent threads should access adjacent memory for efficiency
  • Occupancy: Ratio of active warps to maximum warps per SM

Connections

  • Triton — Python-based language for writing efficient GPU kernels
  • Kernel Fusion — Technique to reduce memory I/O by combining operations
  • Sparton — Uses GPU architecture knowledge for efficient LSR training

Appears In