Fault tolerant cluster-resiliency systems for distributed LLM pretraining on heterogenous hardware
Published:
Takeaways
- Quantified the training cost for ideal situations, and how to account for reality
- What failures to expect on a 128xMI300X AMD cluster when doing distributed pretraining
- How cluster resiliency can resolve this, and some allusion to the work we do with Aegis
Full version on Substack → Pretraining Economics for Distributed Multi-Node LLM Training, and a justification for Cluster Resiliency
