Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

less than 1 minute read

Published: November 26, 2025

Takeaways

Designed a kernel, which faces a bug. Understanding and debugging the issue teaches you a lot about the limitations of floating point representations
Delved deeper into the mathematics of IEEE-754, and common implementations of the standard such as bfloat16, fp32, fp64, and fp8
Explained why an unsigned 32 bit integer may be more precise in workloads such as hashing than FP32
Conclude with a mathematical proof that characterizes the cases in which a 32-bit integer will be more precise than floating point schemes like FP32

Full version on Substack → Curb your memory hierarchy

Summary

Starting out in kernels and facing inexplicable bugs? Yeah, that was once me. Check this blog out for a walk through concepts that typically are seen as obscure in the high level programming world. We finish with a proof with practical debugging implications to weed out integer/floating-point precision bugs.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Takeaways

Trained a RL policy model from scratch, with the weights open sourced and released on huggingface

Beyond surface level discussion of the training dynamics, with specific analysis of loss curves and rewards in W&B runs

Discussion of training failures, and how using a target network can be used to prevent training instability

Takeaways

150× speedup for RISC-V convolutions in simulation, and 65× speedup post-tapeout.

Designed a full System-on-Chip from scratch on Intel 16 nm technology.

Overcame major integration challenges on a TileLink-based NoC architecture.

Verified the accelerator on real silicon, developing C software to run 2D convolutions.

Ansh Chaurasia

Kernels, and a cheeky IEEE-754 proof with somewhat practical debugging value

Takeaways

Summary

Share on

You May Also Enjoy

Cartpole: A somewhat deep introduction to RL and value based learning

Takeaways

GPUs in AI: Understanding the design of NVIDIA GPUs from the ground up, with AI compute cluster considerations

Takeaways

Speeding Towards Silicon: Building a RISC-V Convolution Accelerator

Takeaways