Making memory bound kernels go brr on AMD’s MI300X
Published:
Takeaways
- Analyzed a vector add kernel using roofline analysis on a MI300X
- Gathered perf counter data that additionally validated roofline assumptions, defined bandwidth util as a good figure of merit
- Mathematically found ridge point, and demonstrated that the vector add kernel in current form will be memory bound, regardless of N
- Delved into the actual internals of MI300X’s memory subsystem
- Messed around with the memory subsystem to show a quetionably tiny amount of speedup
Full version on Substack → Making Memory Bound Kernels go brr on MI300X
