Session Outline

Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases at Meta. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to operate this stack in production.

Key Takeaways

  • Challenges in production for ML Training Infrastructure
  • How is operating ML Infra different from standard Infra services
  • How to measure reliability and operate ML Infra at scale


Speaker Bio

Shivam Bharuka – Production Engineer | Meta

Shivam is an engineering leader with Meta as part of the AI Infrastructure team for the last three years. During this time, he has helped scale the machine learning training infrastructure at Meta to support large scale ranking and recommendation models, serving more than a billion users. He is responsible for driving performance, reliability, and efficiency-oriented designs across the components of the ML training stack at Meta. Shivam holds a B.S. and an M.S. in Computer Engineering from the University of Illinois at Urbana-Champaign.

November 8 @ 16:00
16:00 — 16:30 (30′)


Shivam Bharuka – Production Engineer | Meta