GQA, and it’s associated inference tokenomics

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published: February 21, 2026

Analyzed a vector add kernel using roofline analysis on a MI300X
Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
Delved into the actual internals of MI300X’s memory subsystem
Messed around with the memory subsystem to show a quetionably tiny amount of speedup

less than 1 minute read

Published: January 03, 2026

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

less than 1 minute read

Published: November 26, 2025

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

1 minute read

Published: November 20, 2025

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound