ML Training in Production at Meta

Session Outline

Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases at Meta. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to operate this stack in production.

Key Takeaways

Challenges in production for ML Training Infrastructure
How is operating ML Infra different from standard Infra services
How to measure reliability and operate ML Infra at scale

————————————————————————————————————————————————————

Speaker Bio

Shivam Bharuka – Production Engineer | Meta

Shivam is an engineering leader with Meta as part of the AI Infrastructure team for the last three years. During this time, he has helped scale the machine learning training infrastructure at Meta to support large scale ranking and recommendation models, serving more than a billion users. He is responsible for driving performance, reliability, and efficiency-oriented designs across the components of the ML training stack at Meta. Shivam holds a B.S. and an M.S. in Computer Engineering from the University of Illinois at Urbana-Champaign.

November 8 @ 16:00

16:00 — 16:30 (30′)

Day 1 | 8 Nov 2022 | INFRASTRUCTURE + DATA ENGINEERINGSTAGE

Shivam Bharuka – Production Engineer | Meta

BUY TICKETS

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.

Cookie	Duration	Description
_ga_P9NY14LEKW	2 years	No description
AnalyticsSyncHistory	1 month	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

ML Training in Production at Meta

Shivam Bharuka – Production Engineer | Meta

Hyperight Summits

Legal

Contact