The World Blew Up but We’re All Okay: How We Managed a Massive-scale Incident at Datadog

Wednesday, 11 October, 2023 - 11:0011:40

Laura de Vesine and Laurent Bernaille, Datadog

Abstract: 

On March 8, 2023 Datadog experienced a massive global outage. In this talk, we will share the trigger for the incident and why it was a massive effort to recover from. We’ll cover in detail the technical lessons we learned from this event, with some highlights of particularly interesting technical challenges and solutions. Finally, we’ll discuss how we ran the incident response itself, successfully coordinating more than 500 engineers over 2+ days of continual response, and how we built an engineering organization capable of that feat (with minimal heroism).

Laura de Vesine, Datadog

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 7 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

Laurent Bernaille, Datadog

Laurent Bernaille worked several years as a consultant specializing in cloud, containers, and automation and helped organizations migrate to the public cloud and adopt containers. He is now Principal Engineer at Datadog and works closely with infrastructure teams, which are responsible for setting up and scaling Kubernetes platforms. Laurent has given several talks on the topic of containers in conferences such as Dockercon, Open Source Summit, Lisa, Velocity or Kubecon.

BibTeX
@conference {292095,
author = {Laura de Vesine and Laurent Bernaille},
title = {The World Blew Up but {We{\textquoteright}re} All Okay: How We Managed a Massive-scale Incident at Datadog},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video