GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

Full version on Substack → Curb your memory hierarchy

Summary

First, we walk through how the design for a GPU even came about through an abstract analogy related to doing dirty laundry for your entire city. In the process, we build up to the skeletals of a GPU, leaning into details that the AI performance engineer (i.e, someone who writes CUDA kernels) would know, while also dipping into microarchitectural details ubiquitous in all NVIDIA GPUs. We then discuss relevant examples of hardware failures that commonly occur during training and inference time (this is from experience in language modeling), and then consider a toy memory bound GEMM problem, which we optimize through more efficient register use. We then analyze the problem of GEMMs itself using analytical equations and roofline principles, determining the critical arithmetic intensity at which my NVIDIA 4060 becomes compute bound. Finally, I create a comprehensive appendix which spills the underbelly of commonly neglected hardware, such as delving into HBM vs GDDR, where Direct Memory Accesses (DMAs) are used and how they may contribute to the memory bound, and finally a discussion on warps and utilization.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Ansh Chaurasia

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

Takeaways

Summary

Share on

You May Also Enjoy

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

Takeaways