GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations
Published:
Takeaways
- Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
- Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
- Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound
Full version on Substack → Curb your memory hierarchy
Summary
First, we walk through how the design for a GPU even came about through an abstract analogy related to doing dirty laundry for your entire city. In the process, we build up to the skeletals of a GPU, leaning into details that the AI performance engineer (i.e, someone who writes CUDA kernels) would know, while also dipping into microarchitectural details ubiquitous in all NVIDIA GPUs. We then discuss relevant examples of hardware failures that commonly occur during training and inference time (this is from experience in language modeling), and then consider a toy memory bound GEMM problem, which we optimize through more efficient register use. We then analyze the problem of GEMMs itself using analytical equations and roofline principles, determining the critical arithmetic intensity at which my NVIDIA 4060 becomes compute bound. Finally, I create a comprehensive appendix which spills the underbelly of commonly neglected hardware, such as delving into HBM vs GDDR, where Direct Memory Accesses (DMAs) are used and how they may contribute to the memory bound, and finally a discussion on warps and utilization.
