Blog posts

2026

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published:

Takeaways

  • Analyzed a vector add kernel using roofline analysis on a MI300X
  • Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
  • Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
  • Delved into the actual internals of MI300X’s memory subsystem
  • Messed around with the memory subsystem to show a quetionably tiny amount of speedup

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published:

Takeaways

  • Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
  • Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
  • Discussion of training failures, and how using a target network can be used to prevent training instability

2025

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

2024

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.