Posts by Tags

arithmetic intensity

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published: March 10, 2026

Takeaways

Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
How GQA can help push decode towards the compute bound

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published: February 21, 2026

Takeaways

Analyzed a vector add kernel using roofline analysis on a MI300X
Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
Delved into the actual internals of MI300X’s memory subsystem
Messed around with the memory subsystem to show a quetionably tiny amount of speedup

asic design

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published: May 20, 2024

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
Designed a full System-on-Chip from scratch on Intel 16 nm technology.
Overcame major integration challenges on a TileLink-based NoC architecture.
Verified the accelerator on real silicon, developing C software to run 2D convolutions.

attention

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published: March 10, 2026

Takeaways

Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
How GQA can help push decode towards the compute bound

bellman_optimality

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published: January 03, 2026

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

bfloat16

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

cartpole_v1

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published: January 03, 2026

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

chipyard

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published: May 20, 2024

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
Designed a full System-on-Chip from scratch on Intel 16 nm technology.
Overcame major integration challenges on a TileLink-based NoC architecture.
Verified the accelerator on real silicon, developing C software to run 2D convolutions.

command-queue

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

convolution

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published: May 20, 2024

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
Designed a full System-on-Chip from scratch on Intel 16 nm technology.
Overcame major integration challenges on a TileLink-based NoC architecture.
Verified the accelerator on real silicon, developing C software to run 2D convolutions.

ddr

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

discount_factor

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published: January 03, 2026

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

fault-tolerance

Fault tolerant cluster-resiliency systems for distributed LLM pretraining on heterogenous hardware

less than 1 minute read

Published: March 22, 2026

Takeaways

Quantified the training cost for ideal situations, and how to account for reality
What failures to expect on a 128xMI300X AMD cluster when doing distributed pretraining
How cluster resiliency can resolve this, and some allusion to the work we do with Aegis

fp32

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

fp64

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

fp8

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

gddr

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

geforce_4060

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

gqa

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published: March 10, 2026

Takeaways

Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
How GQA can help push decode towards the compute bound

hardware acceleration

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published: May 20, 2024

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
Designed a full System-on-Chip from scratch on Intel 16 nm technology.
Overcame major integration challenges on a TileLink-based NoC architecture.
Verified the accelerator on real silicon, developing C software to run 2D convolutions.

hbm

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

mantissa

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

memory-bounds

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

mha

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published: March 10, 2026

Takeaways

Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
How GQA can help push decode towards the compute bound

mi300x

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published: February 21, 2026

Takeaways

Analyzed a vector add kernel using roofline analysis on a MI300X
Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
Delved into the actual internals of MI300X’s memory subsystem
Messed around with the memory subsystem to show a quetionably tiny amount of speedup

perf counters

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published: February 21, 2026

Takeaways

Analyzed a vector add kernel using roofline analysis on a MI300X
Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
Delved into the actual internals of MI300X’s memory subsystem
Messed around with the memory subsystem to show a quetionably tiny amount of speedup

precision

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

pretraining

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

risc-v

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published: May 20, 2024

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
Designed a full System-on-Chip from scratch on Intel 16 nm technology.
Overcame major integration challenges on a TileLink-based NoC architecture.
Verified the accelerator on real silicon, developing C software to run 2D convolutions.

roofline analysis

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

1 minute read

Published: November 20, 2025

Takeaways

Understanding the fundamentals of GPU design for NVIDIA products in Aleksa Gordic style blog
Designed a kernel, understand subtle limitations, with proof by examining compiled SASS code
Analyze a toy GEMM workload on my 4060 GPU, and determine critical conditions for memory bound vs compute bound

training-economics

Fault tolerant cluster-resiliency systems for distributed LLM pretraining on heterogenous hardware

less than 1 minute read

Published: March 22, 2026

Takeaways

Quantified the training cost for ideal situations, and how to account for reality
What failures to expect on a 128xMI300X AMD cluster when doing distributed pretraining
How cluster resiliency can resolve this, and some allusion to the work we do with Aegis

value

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published: January 03, 2026

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
Discussion of training failures, and how using a target network can be used to prevent training instability

young-daly-checkpointing

Fault tolerant cluster-resiliency systems for distributed LLM pretraining on heterogenous hardware

less than 1 minute read

Published: March 22, 2026

Takeaways

Quantified the training cost for ideal situations, and how to account for reality
What failures to expect on a 128xMI300X AMD cluster when doing distributed pretraining
How cluster resiliency can resolve this, and some allusion to the work we do with Aegis

Ansh Chaurasia

Posts by Tags

arithmetic intensity

Takeaways

Takeaways

asic design

Takeaways

attention

Takeaways

bellman_optimality

Takeaways

bfloat16

Takeaways

cartpole_v1

Takeaways

chipyard

Takeaways

command-queue

Takeaways

convolution

Takeaways

ddr

Takeaways

discount_factor

Takeaways

fault-tolerance

Takeaways

fp32

Takeaways

fp64

Takeaways

fp8

Takeaways

gddr

Takeaways

geforce_4060

Takeaways

gqa

Takeaways

hardware acceleration

Takeaways

hbm

Takeaways

mantissa

Takeaways

memory-bounds

Takeaways

mha

Takeaways

mi300x

Takeaways

perf counters

Takeaways

precision

Takeaways

pretraining

Takeaways

risc-v

Takeaways

roofline analysis

Takeaways

training-economics

Takeaways

value

Takeaways

young-daly-checkpointing

Takeaways