Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published: January 03, 2026

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

less than 1 minute read

Published: March 22, 2026

Quantified the training cost for ideal situations, and how to account for reality
What failures to expect on a 128xMI300X AMD cluster when doing distributed pretraining
How cluster resiliency can resolve this, and some allusion to the work we do with Aegis

less than 1 minute read

Published: March 10, 2026

Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
How GQA can help push decode towards the compute bound

less than 1 minute read

Published: February 21, 2026

Analyzed a vector add kernel using roofline analysis on a MI300X
Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
Delved into the actual internals of MI300X’s memory subsystem
Messed around with the memory subsystem to show a quetionably tiny amount of speedup

less than 1 minute read

Published: November 26, 2025

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32