Posts by Tags

arithmetic intensity

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published:

Takeaways

  • Analyzed a vector add kernel using roofline analysis on a MI300X
  • Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
  • Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
  • Delved into the actual internals of MI300X’s memory subsystem
  • Messed around with the memory subsystem to show a quetionably tiny amount of speedup

asic design

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.

attention

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

bellman_optimality

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published:

Takeaways

  • Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
  • Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
  • Discussion of training failures, and how using a target network can be used to prevent training instability

bfloat16

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

cartpole_v1

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published:

Takeaways

  • Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
  • Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
  • Discussion of training failures, and how using a target network can be used to prevent training instability

chipyard

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.

command-queue

convolution

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.

ddr

discount_factor

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published:

Takeaways

  • Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
  • Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
  • Discussion of training failures, and how using a target network can be used to prevent training instability

fault-tolerance

fp32

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

fp64

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

fp8

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

gddr

geforce_4060

gqa

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

hardware acceleration

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.

hbm

mantissa

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

memory-bounds

mha

GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

mi300x

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published:

Takeaways

  • Analyzed a vector add kernel using roofline analysis on a MI300X
  • Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
  • Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
  • Delved into the actual internals of MI300X’s memory subsystem
  • Messed around with the memory subsystem to show a quetionably tiny amount of speedup

perf counters

Making memory bound kernels go brr on AMD’s MI300X

less than 1 minute read

Published:

Takeaways

  • Analyzed a vector add kernel using roofline analysis on a MI300X
  • Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
  • Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
  • Delved into the actual internals of MI300X’s memory subsystem
  • Messed around with the memory subsystem to show a quetionably tiny amount of speedup

precision

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published:

Takeaways

  • Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
  • Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
  • Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
  • Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

pretraining

risc-v

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

less than 1 minute read

Published:

Takeaways

  • 150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.
  • Designed a full System-on-Chip from scratch on Intel 16 nm technology.
  • Overcame major integration challenges on a TileLink-based NoC architecture.
  • Verified the accelerator on real silicon, developing C software to run 2D convolutions.

roofline analysis

training-economics

value

Cartpole: A somewhat deep introduction to RL and value based learning

less than 1 minute read

Published:

Takeaways

  • Trained a RL policy model from scratch, with the weights open sourced and released on huggingface
  • Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs
  • Discussion of training failures, and how using a target network can be used to prevent training instability

young-daly-checkpointing