GQA, and it’s associated inference tokenomics
Published:
Takeaways
- Written proof on the arithmetic intensity of attention leaves it memory bound in the decode stage, along with the critical operation that makes attention memory bound
- Written proof on why attention can be compute bound, along with the critical operation that makes attention memory bound
- How GQA can help push decode towards the compute bound
Full version on Substack → Grouped Query Attention and Tokenomics
