GQA, and it’s associated inference tokenomics

less than 1 minute read

Published:

Takeaways

  • Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
  • Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
  • How GQA can help push decode towards the compute bound

Full version on Substack → Grouped Query Attention and Tokenomics