All the times listed below are in Irish Standard Time.
08:45–09:00
Opening Remarks
The Liffey
Effie Mouzeli, Wikimedia Foundation, and Vanessa Yiu, UBS
09:00–10:30
The Liffey
eBPF Superpowers for SRE
Liz Rice, Isovalent
eBPF is a kernel technology that is enabling a new generation of infrastructure tooling, especially for observability, security, and connectivity - but why should SREs care? This talk explains what eBPF is, why its popularity has exploded in recent years, and why it's such a powerful platform.
You'll leave this talk with a mental model for what eBPF is as a platform, and ideas of useful open source eBPF-based projects that can help you solve various performance, connectivity, and security problems.
Liz Rice, Isovalent
Liz Rice is Chief Open Source Officer with eBPF specialists Isovalent, creators of the Cilium cloud native networking, security and observability project. She is the author of Container Security, and Learning eBPF, both published by O'Reilly, and she sits on the CNCF Governing Board, and on the Board of OpenUK. She was Chair of the CNCF's Technical Oversight Committee in 2019-2022, and Co-Chair of KubeCon + CloudNativeCon in 2018.
Building a 5-Exaflop Supercomputer for Meta-AI Research and Supporting Large-Scale Model Training with a Small Distributed Software and Production Engineering Team
Learn how Meta's latest AI Research SuperCluster with 16,000 GPUs was architected, built and operated by a small geographically distributed team of Software and Production Engineers (SRE) working closely together as one team.
We share insights from operating one of the biggest AI supercomputers, with 5 exaflops of compute power, InfiniBand interconnect, and a high-performance storage system coming together to train leading edge AI models from Meta, as well as the monitoring and observability needs that emerged from supporting large-scale model training (including the recently released Llama series).
Kalyan Saladi, Meta Platforms
Kalyan is a software engineer at Meta working in the AI Research Infrastructure team, with experience in production ML Infrastructure (FBLearner), large scale services reliability and performance. Before that they worked at VMware in virtualization, scheduling and distributed systems.
Chris Bray, Meta Platforms. Inc.
Chris has been a Production Engineer at Meta for over a decade, with a variety of experience from user-facing products like Graph Search and Instagram, acquisitions such as Oculus and WhatsApp, to most recently Meta's various AI Infrastructure platforms for Training, Inference, and most recently Research.
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey A
From Sysadmins to (almost) Flying Unicorns
Guillaume Hérail and Gilberto Müller, Sony Interactive Entertainment
It's 2023 and companies still struggle to make the switch to an SRE organization and the definition of such a team usually varies from one team to another. Our company is no stranger to this issue but managed to find ways to improve the situation. Our talk focuses on how we improved the culture in our organization by starting as a system administration/operations team to a fully included SRE team, we will review what we did to improve the culture in our organization as well as improving the overall reliability of our services.
Guillaume Hérail, Sony Interactive Entertainment
Guillaume is a french SRE lost in Germany, desperately trying to learn German. He lives in Berlin and has been working for PlayStation for the past 5 years and counting. In a past life, Guillaume used to be a cop but now prefers to chase bugs and improve our Cloud Gaming service reliability with a strong focus on Observability!
Gilberto Müller, Sony Interactive Entertainment
Gilberto started his career as a SysAdmin, serendipitously stumbled upon the Site Reliability Engineering domain that was basically what he has been doing his whole career, but without the tooling or recognition. Gilberto loves the (for him perpetual) barbecue season, lives in Berlin, Germany with his family, works for Sony for 2.5 years and was a fan of LISA, long live LISA!
Implementing SRE in a Telco with Reliability Enhancing Procedures
Florian Kammermann and Romain Bonjour, Swisscom
In 2021 our Telco had several severe outages. This was our chance to introduce SRE practices. The Organization had followed a traditional ITIL Operation model until then.
To simplify and scale the adoption of SRE practices, we came up with the idea to create "Reliability Enhancing Procedures". They are Cookbook style work instructions. The teams can work through the Cookbook themselves and improve the reliability of their Services. They are meant to scale across organization and teams.
The Reliability Enhancing Procedures should be based on the SRE practices, but also from other inspirations. Over the time we introduced nine Reliability Enhancing Procedures. All these REP's have to be executed on our Services (hundreds of them).
It may just look like a program to increase reliability, but at the end it is a huge transformation initiative how to manage the reliability of our services in an SRE style.
Florian Kammermann, Swisscom
Florian was software and devops engineer for a long time. Two years ago he saw the opportunity to drive the SRE Adoption company wide and took the role as Reliability Enterprise Architect. With a heavy heart, he said goodbye to terminal and code and became an advocate for site reliability engineering practices and data driven operation.
Romain Bonjour, Swisscom
Romain is currently leading the site reliability engineering (SRE) transformation of mobile network infrastructure at Swisscom, implementing best-practices, tools, and teachings around reliability of complex systems. He strongly believe in automated release engineering for both could native system and legacy technologies as the key enabler for faster releases of high availability systems.
The Liffey B
Symptom-based Alerting for Machine Learning - What I Learned from Monitoring More than 30 Machine Learning Use Cases
Lina Weichbrodt, ML Freelance and Consulting
Traditional software monitoring best practices are not enough to detect problems with machine learning stacks. How can you detect issues and be alerted in real-time? This talk will give you a practical guide on how to do machine learning monitoring: which metrics should you implement and in which order of priority? Can you use your team's existing monitoring and dashboard tools, or do you need an MLOps platform?
Lina Weichbrodt[node:field-speakers-institution]
Lina has 10+ years of industry experience in developing scalable machine-learning models and bringing them into production. She currently works as a pragmatic machine-learning freelancer and consultant. She has helped clients in e-commerce, fintech, mobility, and travel to get value out of their AI projects. She previously worked at Zalando developing real-time, deep-learning personalization models for more than 32M users.
Reliable Data for Large ML Models: Principles and Practices
Mary McGlohon, Google
A Machine Learning model is only as good as the data it trains on. However, as Large Language Models (LLMs) become more sophisticated and commonplace, the scale and complexity of ML training data and model output have dramatically increased. This presents new challenges to the already-difficult work of ensuring high-quality, reliable ML data. But we need not fear: In addressing these challenges, many foundational SRE principles still apply, such as managing tradeoffs between flexibility and stability, accounting for human operations within a system, and defining clear reliability requirements between systems' owners.
In this talk we describe common data reliability challenges in ML, and how they manifest in LLMs compared to the "classic" supervised ML systems. Drawing from experience and SRE principles, we recommend best practices for assessing and managing ML data risks in your production systems.
Mary McGlohon, Google
Mary McGlohon is a Site Reliability Engineer at Google, who has worked on large-scale ML systems for the past 6 years. Prior to that, her career included data mining research, software development, and distributed pipeline systems. She completed a B.S. in computer science from the University of Tulsa and a Ph.D. in machine learning from Carnegie Mellon University. She is interested in making ML data higher quality, more observable, and easier to debug.
12:30–14:00
Luncheon
The Forum
Sponsored by Cortex
14:00–15:30
The Liffey A
New Grads Becoming New SREs: Catalyzing a “Circle of Life” in Ireland
Jennifer Petoff, Google Portugal, and Catalina Rete, Google Ireland
Site Reliability Engineering principles, best practices, and culture do not feature systematically in the undergraduate curriculum around the world. Nor do principles of non-abstract large system design. Despite this, students can be taught (and learn through experience) to be great SREs upon graduation.
This talk will equip SRE hiring managers with creative ways to build a pipeline of talent. We’ll share techniques that we’ve found to be effective in super-charging our SRE hiring pipeline from universities in Ireland.
Jennifer Petoff, Google Portugal
Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems; lead author of Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program; and is a regular speaker at DevOps and SRE conferences around the world.
Catalina Rete, Google Ireland
Catalina Rete is a Site Reliability Engineer in the AdsML SRE team in Dublin. Her team ensures the reliability of one of the largest machine-learning pipelines in the world. Catalina interned a total of three times with Google and now she has just celebrated her 2 year Google anniversary since joining a full time position!
Scale Your Future: An Immersive Engineering Programme
Radha Kumari, Slack; Margarita Glushkova and Berkeli Halmyradov, Slack / Code Your Future
Software infrastructure and operations are hard to break into. Most computer science grads aren't well-prepared for the industry but pathways like internships do exist. For non-graduates the doors are locked and bolted.
This talk is about how Slack partnered with Code Your Future, a free, volunteer-run coding school for refugees, asylum seekers and other people excluded from education, to build a volunteer team of industry professionals and designed an innovative curriculum and work placement to find and nurture promising new talent.
Join us as we hear first-hand what it was like for two of those new engineers, who have been through the course and the work placement: what they learned, what surprised them, what was difficult, and how their outlook on software engineering has changed.
We hope attendees will be inspired to invest in similar initiatives and unlock opportunities for under-represented individuals.
Radha Kumari, Slack
Radha is a Staff Software Engineer for the Demand Engineering team at Slack (Ireland) where she focuses on ensuring "bytes" move in and out of Slack as expected.
Outside work, she loves travelling around the world and has been to over 30 countries since 2013. She enjoys playing keyboard and also has a passion for collecting shoes.
Margarita Glushkova, Slack / Code Your Future
Margarita is a graduate from Code Your Future and now a Software Engineering Intern at Slack (UK), where she is part of Demand Engineering team responsible for Ingress traffic and Service Discovery. She has a passion for learning about complex distributed systems and how to keep them reliable and fast.
During her free time, she enjoys hiking with her dogs and exploring the English countryside.
Berkeli Halmyradov, Slack / Code Your Future
Berkeli is a passionate Software Engineering Intern at Slack (UK), where he is part of the Demand Engineering team responsible for ingress traffic and service discovery. A proud graduate of Code Your Future, Berkeli's commitment to continual learning drives his interest in exploring new technologies. In addition to his professional pursuits, he volunteers his spare time at Code Your Future, reinforcing his dedication to the tech community.
The Liffey B
Over, Under, Around, and Through: A Detailed Comparison of QUIC and HTTP/3 Application Mapping vs. Protocol Encapsulation
Lucas Pardue, Cloudflare
This talk explores how QUIC and HTTP/3 are used for a range of diverse use cases. It provides a rapid on-ramp to the technology concepts and focuses on practical considerations that help to effectively use, operate, manage, or debug these fairly new protocols. We'll dive into the different approaches that applications can take to using QUIC, by focusing on HTTP/3 and related MASQUE tunneling techniques. We'll cover QUIC features, caveats, opportunities, and potential pitfalls. By the end, the audience will be comfortable with QUIC and familiar with a range of tools and resources that will help aid with specific needs of today or future developments of tomorrow.
Lucas Pardue, Cloudflare
Lucas is an engineer on the Cloudflare Protocols team, responsible for Layer 4 and Layer 7 traffic termination and optimization including TCP, TLS, QUIC, HTTP/2 and HTTP/3. He is also an active participant in open standards across a number of roles: Co-Chair of the IETF QUIC Working Group, author and contributor to several QUIC, HTTP and MASQUE related RFC standards, and board member of the UK Government's Open Standards Board.
Deploying and Debugging HTTP/3
Robin Marx, Akamai
Modern networking protocols like QUIC and HTTP/3 are not only complex, they're also almost fully encrypted, even at the transport layer. The combination of these two aspects makes them hard to debug and deploy, especially at scale.
This talk will focus on features like packet header protection, 0-RTT, alt-svc, resource prioritization, and connection migration and how they can cause issues when setting up, load balancing and firewalling real-world QUIC deployments. We will also discuss practical options for debugging the protocols, like curl, wireshark, and the bespoke qlog and qvis tooling. All of this is supported by "tales from the trenches" from companies like Akamai, Google, Meta and Cloudflare.
So, if you've always wanted to know how Google DDoSed itself, with the only recovery option being to turn off QUIC for a week, join us for this technical deep-dive!
Robin Marx, Akamai
Dr. Robin Marx is a Web Performance Expert at Akamai Technologies. He focuses on the performance and workings of modern Web protocols like HTTP/2, HTTP/3 and QUIC and has been a contributor in the IETF QUIC working group for multiple years. His PhD research was focused on debugging and understanding these protocols.
Robin often talks about web performance at international conferences, making the complex situations more insightful to the wider public. On the weekends, he likes to hit other people with longswords.
15:30–16:00
Break with Refreshments
The Forum
16:00–17:30
The Liffey A
The Engineer/Manager Pendulum Goes Mainstream
Charity Majors, honeycomb.io
Going back and forth between management and engineering is no longer an anomaly, but an accepted career path. Yet our institutions and expectations are still designed around the assumption that you will pick one track and stay there. Let's talk about what kind of structural changes and incentives we should make to modernize engineering career tracks -- and how to convince senior leadership it's worth doing.
Charity Majors, Honeycomb.io
Charity is the cofounder and CTO of honeycomb.io, the O.G. observability company, and the coauthor of O'Reilly books "Database Reliability Engineering" and "Observability Engineering". She writes about tech, leadership and other random stuff at https://charity.wtf.
Panel Discussion: Layoffs in Tech and The Day After
Moderator: Emil Stolarsky, Wave Mobile Money
For the past couple of years (and counting), we have witnessed tech companies of any size, drastically reducing their positions, and slowing down hiring. In this panel we will discuss how we got here, and what does this mean about the past and the future of SRE.
Emil Stolarsky, Wave Mobile Money
Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously, he worked on caching, performance, and disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at and organizing a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.
The Liffey B
Do Not Thrash the Node.js Event Loop
Matteo Collina, Platformatic
Deploying Node.js at scale is an art mastered by few. The most common problem is an exhaustion of resources that allows the application to denial of service itself. The result is Node.js systems that are massively overprovisioned, wasting enormous amounts of computing and memory - keeping most of them idle. In this talk, we will do some math, discover hard truths, and implement a fix.
Matteo Collina, Platformatic
Matteo is the Co-Founder and CTO of Platformatic.dev with the goal to remove all friction from backend development. He is also a prolific Open Source author in the JavaScript ecosystem and modules he maintain are downloaded more than 17 billion times a year. Previously he was Chief Software Architect at NearForm, the best professional services company in the JavaScript ecosystem. In 2014, he defended his Ph.D. thesis titled "Application Platforms for the Internet of Things". Matteo is a member of the Node.js Technical Steering Committee focusing on streams, diagnostics and http. He is also the author of the fast logger Pino and of the Fastify web framework. Matteo is an renowed international speaker after more than 60 conferences, including OpenJS World, Node.js Interactive, NodeConf.eu, NodeSummit, JSConf.Asia, WebRebels, and JsDay just to name a few. He is also co-author of the book "Accelerating Server-Side Development with Fastify" edited by Packt. In the summer he loves sailing the Sirocco.
Scaling Chef Emotionally
Brett Pemberton, Slack
Slack has used Chef to configure our fleet since the get-go, including several periods of hypergrowth - things from the initial launch through to pandemic times. This growth also meant scaling the number of Engineers who wanted to successfully use Chef in their own way.
Most of the time we were able to sustain a scalable and flexible Chef Infrastructure and codebase to satisfy ourselves and our Engineers. Sometimes not.
We'll talk about the path we took scaling Chef to handle this, the emotions we felt and the decisions we had to make to be able to create a culture of safely pushing changes while still allowing ourselves to break things along the way.
Brett Pemberton, Slack
Brett has been professionally swearing at computers for the last two decades. He's spent the last 6 years doing so at Slack, with love. He only sometimes breaks things. He has recently moved to the Countryside and discovered that applying SRE Principles to a Flower Farm does not always bring success.
17:30–19:30
Conference Reception at the Sponsor Showcase
The Forum
Enjoy dinner and beverages while networking with other attendees and visiting the exhibits as we close out the first day of sessions!
09:00-10:30
The Liffey A
SRE for [cyber]security
Nicolas Fischbach, Google
In this talk I'll share lessons learned from leading the SRE teams at Google who look after our low-level infrastructure security, make "prod" more secure, deliver industry-leading Cloud Security products and address global resilience by preparing for low-probability high-consequences events. To make this practical, I'll apply an enterprise CISO/BoD lens to it (informed by my previous experience as a cybersecurity software vendor CTO/VPE).
Nicolas Fischbach, Google
Nico leads Security, Privacy, Resilience and Cloud AI SRE at Google. His global teams look after the most critical low-level security components that underpin all of Google's infrastructure as well as the Cloud Security & AI products and services in Google's enterprise offering. His team is also tasked in making "Prod" more secure as well as researching and preparing for Low Probability and High Consequence events. Before Google Nico was the CTO and VP of Engineering at Forcepoint and Director of Architecture and Innovation at Colt Technology Services. Over the last two decades he presented at numerous technical and business conferences in the cybersecurity and SP/telco domains.
Cloud, Kubernetes, and Service Networking - Taming the Turtles
Matt Turner
Networking in Kubernetes is a black art to most people. It mostly works, and you mostly don't have to care. However for debugging issues - including day 2 performance and security issues - a correct mental model is crucial. Add the complexities of the underlying VPC, and a service mesh like Istio, and it’s hard to know where one ends and the next starts, let alone how they interact. And that’s before we talk about how they all use eBPF.
In this session, I'll show how all the layers work and interact, covering things like
- What's CNI vs kube-proxy?
- What's the "Kubernetes Networking Model" and how does it interact with cloud providers' VPCs?
- How's iptables and eBPF used by all these systems?
Matt Turner[node:field-speakers-institution]
Matt is a software engineer at Tetrate, working on Istio-related products, and loves sharing the latest tech and trends with everyone. He's been doing Dev, sometimes with added Ops, for over a decade. His idea of "full-stack" is Linux, Kubernetes, and now Istio too. He's given many talks and workshops on Kubernetes and Istio, and is co-organiser of the Service Mesh London meetup. He tweets @mt165 and blogs at https://mt165.co.uk
The Liffey B
Designing Matrix: A Global Decentralised End-to-End Encrypted Communication Network
Matthew Hodgson, Matrix.org
Matrix is a open source project that publishes an open standard protocol and reference implementations for secure, decentralised real-time communication, stewarded by the non-profit Matrix.org Foundation. Initially launched in Sept 2014, there are now over 106M users on the network, spread over ~100K server instances, providing a secure open source alternative to Teams/Slack/WhatsApp/Discord within a global network. Users include the governments of France, Germany, UK, US, Sweden, Poland, Ukraine and many others, as well as major open source organisations including Mozilla, the Wikimedia Foundation, KDE, GNOME, Debian, etc.
In this talk, we'll explain the unique challenges of designing a byzantine-fault tolerant distributed system that operates without transactional finality, optimised for relatively low latency real-time communication. We'll also dig into the specific approaches used to scale and operate large-footprint individual instances, and give a preview to upcoming work around portable identity, multi-homed accounts and eventually Peer-to-Peer Matrix.
Matthew Hodgson, Matrix.org
Matthew Hodgson is Project Lead and co-Founder of Matrix, an open source project for secure decentralised communication. His day job is CEO/CTO at Element, the startup founded by the core Matrix team to provide services and solutions around Matrix and so fund the project. In previous lives he led VoIP and SRE teams, and has a degree in Physics & Computer Science from the University of Cambridge.
Tracing the Journey into Distributed Tracing
Pedro Alves, Wayfair DE
Onboarding new technologies in large scale organizations can be quite the challenge. Distributed Tracing is one such technology with companies occasionally struggling with its roll out, and later collecting the benefits that it promises. This talk describes the steps that led to a successful (although laborious) process to adopt Distributed Tracing in a large scale org. We'll go through those steps, and from there extract some guidelines on how to approach onboarding other technologies, so we don't reinvent the wheel every time.
Pedro Alves, Wayfair
Pedro has been developing back end code for webapps since 2008, across different business areas. From 2018 the focus has been on all things SRE. Helping teams improve their Observability, building tools to help engineers with alerting, solving complex bugs, scaling applications, and handling weird incidents.
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey A
The World Blew Up but We’re All Okay: How We Managed a Massive-scale Incident at Datadog
Laura de Vesine and Laurent Bernaille, Datadog
On March 8, 2023 Datadog experienced a massive global outage. In this talk, we will share the trigger for the incident and why it was a massive effort to recover from. We’ll cover in detail the technical lessons we learned from this event, with some highlights of particularly interesting technical challenges and solutions. Finally, we’ll discuss how we ran the incident response itself, successfully coordinating more than 500 engineers over 2+ days of continual response, and how we built an engineering organization capable of that feat (with minimal heroism).
Laura de Vesine, Datadog
Laura de Vesine is a 20+ year software industry veteran. She has spent the last 7 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.
Laurent Bernaille, Datadog
Laurent Bernaille worked several years as a consultant specializing in cloud, containers, and automation and helped organizations migrate to the public cloud and adopt containers. He is now Principal Engineer at Datadog and works closely with infrastructure teams, which are responsible for setting up and scaling Kubernetes platforms. Laurent has given several talks on the topic of containers in conferences such as Dockercon, Open Source Summit, Lisa, Velocity or Kubecon.
When One Line Took Thousands of Websites Offline
Francisco Borges Aurindo Barros and Jack Henschel, CERN
This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.
Francisco Borges Aurindo Barros, CERN
Francisco Barros is an SRE at CERN. He likes to specialize on automating the repetitive, working with open source technologies, and helping to develop and maintain reliable and modern solutions. Currently he manages a Kubernetes flavored cluster that handles all the CMS websites at CERN. He lives near Geneva and in winter likes to snowboard.
Jack Henschel, CERN
Jack Henschel is a Cloud Computing Engineer at CERN where he develops and administrates several Kubernetes cluster, ensuring all components integrate smoothly with the rest of CERN's computing environment. His special areas of interest are systems performance, observability and efficiency. In his free time he likes exploring the French and Swiss Alps by foot and bike.
The Liffey B
HTTP Headers that Make Your Website Go Faster
Thijs Feryn, Varnish Software
Slowdowns can be the death of any web. These problems, which often happen during traffic spikes, have a detrimental effect on the user experience and can result in loss of customers.
Fortunately, there are some simple and effective techniques to mitigate the impact of traffic spikes – however, most developers aren’t using these techniques to their full potential: caching HTTP responses.
While there are many caching implementations out there, HTTP already has conventional caching mechanisms built into the protocol, which are respected by most caching systems.
In this presentation, the audience will learn how to leverage some of those built-in mechanisms and apply them to a reverse caching proxy. The caching proxy that is featured in this presentation is Varnish Cache.
The audience will learn how to control the lifetime of cached objects, create cache variations, perform conditional requests, handle stale data, revalidate expired content, handle errors, and use HTTP placeholders.
Thijs Feryn, Varnish Software
As the Technical Evangelist at Varnish Software, Thijs Feryn focuses on web performance, software scalability, and content delivery. He demonstrates content-driven and technical messaging through presentations, videos, books, blog posts, social media posts, podcasts, and other media.
Thijs is a published author and wrote Getting Started with Varnish Cache and Varnish 6 by Example. As a public speaker, he has a track record of over 290 presentations in 22 different countries, where he is often praised for his energetic and engaging presentation style.
As an evangelist, Thijs is also active in many open-source communities, most notably the Varnish and PHP community. He has contributed to various communities for almost 20 years both technically and as an organizer and facilitator.
For more information, please visit - https://feryn.eu/speaking
Cache Me If You Can: How Grafana Labs Scaled Up Their Memcached 42x & Improved Reliability
Danny Kopping, Grafana Labs
Our cloud database stores billions of files in object storage. We were encountering severe rate-limits when attempting to run large queries, and we needed a way to cache a large subset in a cost-effective manner without hurting performance.
In this talk, I'll describe how we're using memcached in combination with local SSDs provided by major cloud vendors to scale up our memcached to tens of terabytes, improving reliability, and saving money!
Liffey Hall 2
Workshop: Statistics for Engineers
Heinrich Hartmann, Zalando
Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set-up your telemetry systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information. Statistics is the art of extracting information from data. In this talk we will discuss mathematical methods, that will help you in your daily work as SRE. Specifically we will cover the following subjects:
- Visualization of telemetry data with chats, scatter plots and heatmaps
- Summarizing and Data with means, medians and percentiles
- Sampling telemetry data and the impact on RED (Rate, Error, Duration) metrics
In the talk we will cover the topics from both the theoretical and the practical side, providing examples for the most relevant use cases and technologies on production data.
12:30–14:00
Luncheon
The Forum
14:00–15:30
The Liffey A
Embracing the Multi-Party Dilemma: Incident Response Across Company Boundaries
Sarah Butt, SentinelOne and Alex Elman, Indeed
For many companies, the past several years have represented a marked shift toward transferring operational responsibility of running critical services to third-party external vendors. This is particularly true as companies seek to scale and lower capital costs. From elastic compute capacity via external cloud providers, vendors for CDN and Edge, DDoS protection and beyond; vendors have increasingly become vital components in software systems. This talk speaks to some of the complexity of managing vendor relationships and the insights gleaned when looking at incidents with vendor involvement across 3 separate companies. Through a dynamic called “The Multi-Party Dilemma,” we introduce concepts from research and real-world findings to help us all learn from incidents across organizational boundaries.
Sarah Butt, SentinelOne
Sarah is SentinelOne's Director of Site Reliability Engineering. She is fascinated by scale, complexity, systems thinking, and non-functional requirements— particularly those around reliability. You'll likely find her talking about topics such as resilience, observability, and incident management and response. Prior to working at SentinelOne, Sarah worked in both Salesforce and Dell's SRE organizations.
Alex Elman, Indeed
For the past twelve years, Alex Elman has been helping Indeed cope with ever-increasing complexity and scale. He is a founding member of the Site Reliability Engineering team. Alex leads the Resilience Engineering team focused on learning from incidents, chaos engineering, and managing incidents.
The Incident Is The Way: Using Your Incidents to Win Reliability Investment
Niall McCarthy, Afterpay
When organisations face a high severity incident they experience a moment of consensus, all parts of the organisation want to see a serious problem solved as quickly as possible. Behind an incident's alerts, on-call pages and status updates is an alignment that in any other context is fleeting and rarely repeated. In this talk we discuss how choices we make during incident management; from who contributes, to how we describe harm, to how we communicate updates; can build an investment dialogue with decision-makers.
When distributed systems ensure everything is a SEV 01 to someone, building a common context for our responses converts our incidents into successful investment arguments.
Niall McCarthy, Afterpay
Niall is an engineering leader at Afterpay spending his days at the intersection of software operations and development. Niall leads Afterpay’s incident management, continually researching practices to use incidents to uplift and align teams and software. When not working, Niall is drinking coffee and failing to 100% an RPG.
That Time I Accidentally DDoS'd My Company
Mike Bongardino
This is a story about how shifting circumstances, key design decisions, and one overlooked line of code created a time bomb in our private cloud resources.
Mike Bongardino[node:field-speakers-institution]
Mike Bongardino is a software engineer, cloud architect, enthusiastic tinkerer, and advocate for the oxford comma. Currently based in Brooklyn NY.
The Liffey B
Artificial Intelligence: How Much Will It Cost You?
Todd Underwood, Google
We are now a full year into the latest AI revolution, this one in generative AI, or large language models. For many organization leaders and SREs the most relevant question that is rarely discussed is: what will it cost and will it be worth it? Important models only matter when they are integrated into some product and served to users. Large model training is incredibly expensive as is large model serving. This talk looks at the history of serving cost curves for simpler applications (web applications!) and understand what the future might bring. We will look at the possible future of the breathtaking costs of large language model training and serving.
Todd Underwood, Google
Todd Underwood is a Senior Engineering Director at Google. He leads ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services and are critical to almost every Product Area at Google. He was previously the Site Lead for Google’s Pittsburgh office. He recently published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).
Just the Cryptography You Need to Know for TLS
Lerna Ekmekcioglu, AWS
We use TLS everyday but have you ever wondered what cryptographic concepts TLS is built upon? In this talk, I'll go over asymmetric keys, symmetric keys, keystores, truststores, mutual TLS along with analogies that will make these concepts accessible to everyone.
Lerna Ekmekcioglu, AWS
Lerna Ekmekcioglu is a Senior Solutions Architect at AWS where she helps Global Financial Services customers build secure, scalable and highly available workloads. She has over 18 years of platform engineering experience including authentication systems, distributed caching, and multi region deployments using IaC and CI/CD to name a few. In her spare time, she enjoys hiking, sight seeing and backyard astronomy.
You Depend on DNS, This Is How It Works and You Won't Believe It
Philip Rowlands, Jane Street
What's everyone's favourite federated, distributed, eventually consistent, caching key-value store? That's right: DNS.
This live demo will answer such questions as Why DNS?, Why DNSSEC?, Why Punycode?, Why did a minor Chrome feature DDOS the root servers?, and Aren't 5 TLDs enough?
Come and marvel at how DNS has survived and adapted over the last 40 years. Yes, 40.
Philip Rowlands, Jane Street
Philip Rowlands has been an SRE since before he really understood what it was. Because he doesn't scale, he relies on software for leverage. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms, all of which used DNS. He cannot juggle.
Liffey Hall 2
Speedrun through Splicing Sockets with Sockmap
Jakub Sitnicki, Cloudflare Inc.
Network proxies have one thing in common. They push data from one side to the other. If the proxy doesn’t touch the data, then it is desirable to offload the job of moving data from one network socket to another to the operating system.
Under Linux, applications can pipe data in batches between sockets with the splice()
syscall. However this is not the only way to splice two sockets together!
Linux also offers another way to push packets between two TCP sockets without exiting to user-space, called sockmap. The mechanism is powered by the built-in logic built in the core network stack and a couple of BPF-based components to drive the operation.
In this talk we will go over everything a user needs to know to get started using BPF sockmap. We will also discuss sockmap features, internal design, as well as its caveats and limitations.
Jakub Sitnicki, Cloudflare Inc.
Jakub is a contributor to the networking and BPF subsystems in the Linux kernel. He is also a co-maintainer of the Linux BPF L7 framework, aka sockmap. At Cloudflare he is part of the team which maintains the company’s internal Linux kernel.
Sandboxing in Linux with Zero Lines of Code
Ignat Korchagin, Cloudflare
Linux seccomp is a simple, powerful tool to sandbox running processes and can significantly decrease damage in case the application code gets exploited. It provides fine-grained controls for the process to declare what it can and can’t do and in most cases has little performance overhead.
But to utilise this framework developers have to explicitly add sandboxing code to their projects and developers usually either delay this or omit completely. Moreover, the seccomp security model is based around system calls, but many developers, writing their code in high-level programming languages and frameworks, either have little knowledge or no experience with syscalls or just don’t have easy-to-use seccomp abstractions for their frameworks.
All this makes seccomp not widely adopted—but what if there was a way to easily sandbox any application in any programming language without writing a single line of code? This presentation discusses potential approaches with their pros and cons.
Ignat Korchagin, Cloudflare
Ignat is a systems engineer at Cloudflare working mostly on Linux, platforms and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as a senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets. Ignat started his career as a security researcher in the Ukrainian government’s communications services.
15:30–16:00
Break with Refreshments
The Forum
16:00–17:30
The Liffey A
9 Things You Should Do When Starting to Use SLOs
Sal Furino
SLOs are hard to get right!
There is much literature out there describing what SLOs are and how calculate them. However incubating SRE and observability culture at organizations can be challenging. Additional guidance is often needed to coach organizations in how to use SLOs. SLOs can empower conversations, enhance understanding of the mental model of a service, and what actions to take towards it's reliability.
As a Customer Reliability Engineer I’ve witnessed many organizations encounter stumbling blocks while attempting to implement SLOs. This talk will discuss 9 of the common issues that are encountered while attempting to implement SLOs and how to get the most value out of the SLOs you create.
Sal Furino[node:field-speakers-institution]
Sal Furino is a Customer Reliability Engineer. During his career he’s worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gaming, traveling, skiing, and golfing. Sal lives in Queens with his partner and has a BS in Applied Mathematics from Marist College.
Silent Spring: What if the GDPR Was Real?
John Looney
The talk is named after Rachel Carson's Silent Spring, where she imagined what would happen if the world kept using pesticides as it had been using them into the 1960s.
We will imagine what the world will look like, if companies do not adjust to the recent EUCJ rulings on the GDPR. For the last few years, many smart people have tried, and failed, to work out how multinationals can continue business as usual. The EU Commission, and the Irish DPC, have done everything they can to ensure the GDPR has not been enforced, by delaying legal actions and announcing wonderful future solutions.
John Looney[node:field-speakers-institution]
John Looney has been working in multinationals that handle large amounts of private & personal data for two decades, and has been thinking about the real-world implications of applying the spirit of EU human rights legislation on global datacenters. He's been involved with running SRECon since its inception.
The Liffey B
From Exceptional Maintenance to Automated Routine Operation: A Story of the Datacenter Switchover for Wikipedia
Giuseppe Lavagetto, Wikimedia Foundation
This is a tale about reducing toil by automating it away.
Once upon a time, there was a single core datacenter for wikipedia. Later, we added a second core datacenter as a failover; the first time we moved traffic between them, it was a multiple-day operation undertaken by multiple engineers which required about 40 minutes of read-only time and an extended period of degraded performance.
Nowadays, we switch traffic multiple times per year. The newest team member is the conductor of the process, and along with a senior member as their sidekick, they carry out the process that requires less than two minutes of read only time and virtually no degraded performance.
I'll introduce the tools we built (all open source) as well as the architectural approach we took to get to that point, which should be also be applicable to other problems/architectures in the same space.
Giuseppe Lavagetto, Wikimedia Foundation
An astrophysicist by trade, I have been disguising myself for the last decade as an SRE for the Wikimedia Foundation, the non-profit that runs your favourite free online encyclopedia. My work focuses mostly on making our application layer flexible, automated and dynamic.
Panel Discussion: Unraveling the Interoperability Challenges Among SaaS Products
Moderator: Murali Suriar, Snowflake
Panelists: V Brennan, Slack; Dave O'Connor; George Kargiotakis, Elastic; Richard "RichiH" Hartmann, Grafana
Software as a Service (SaaS) products have revolutionized the way businesses operate. However, as organisations increasingly adopt multiple SaaS solutions to cater to diverse needs, a significant hurdle emerges - the lack of seamless integration and harmonious coexistence among these platforms. This panel event aims to shed light on the intriguing question of "Why SaaS products do not play nice with each other?"
V Brennan, Slack
V Brennan is a seasoned leader with a career in infrastructure management, currently serving as the Senior Director of Infrastructure at Slack in Dublin, Ireland. In this role, she oversees global teams specialising in Traffic, Service Mesh, and Datastores. Additionally, she bears the critical responsibility of managing Slack’s incident response efforts within the region, holding the position of one of Slack’s most tenured Incident Commanders. Prior to her role at Slack, V contributed her expertise as an Engineering and Product Manager at Spotify, stationed in the vibrant tech hub of Stockholm, Sweden. During her tenure, she led projects involving internal infrastructure and the development of internal services. A significant achievement was orchestrating the migration of this infrastructure from on-premises systems to Google Cloud. Beyond her technical accomplishments, V is passionate about leadership and the cultivation of high-performing teams. Her unwavering commitment to reliability and resilience serves as a constant backdrop in her endeavours, particularly when dealing with the challenges of scaling operations.
Dave O'Connor [node:field-speakers-institution]
Dave O'Connor is an SRE Leadership practitioner and Coach based in Dublin, Ireland. Dave spent 16 years as an SRE at Google, failing to prevent and even being complicit in the development of the function of SRE from its inception. He has also spent time leading the SRE/Production efforts at Elastic and Twilio. His interests include organisation and team development, leadership coaching, and gently telling you about problems you didn't know you had.
George Kargiotakis, Elastic
George is a Linux enthusiast and a Tech Lead for the Infrastructure SRE team at Elastic. With a passion for ensuring the reliability and scalability of complex systems, he has been helping Elastic design, implement, and maintain highly available and performant systems. George thrives on the adrenaline of network troubleshooting investigations, always ready to rise to the challenge and untangle the most intricate tech puzzles. Beyond the realm of ones and zeros, he's equally passionate about exploring the mysteries of the underwater world through scuba diving adventures.
Richard "RichiH" Hartmann, Grafana
Richard "RichiH" Hartmann is the Director of Community at Grafana Labs, a member of the Office of the CTO of Grafana Labs, a Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendees, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch.
Liffey Hall 2
Improving Kafka Resilience - Gray Failures Mitigation
Michelle Valentinova, New Relic
We’ve had many problems with a single partially healthy broker causing disproportionate issues for Kafka processing that we never expected.
We cover in-depth the different scenarios that allowed this to happen and the configuration we had chosen with the best intentions that made these outages possible or even made them worse.
The outages vary from shallow broker health checks combined with storage timeouts and a certain producer configuration leading to a 20+ minute full service outage caused by a single partially healthy broker. In a different scenario simply trying to consume data from a broker in the same availability zone results in blocked processing after a broker reboots in the same AZ as the consumers. And the most complex one - solving the issue with partially healthy brokers when producing with a partition key.
We will provide a summary of the changes that allowed us to make Kafka more resilient and still have it configured to meet business needs.
Michelle Valentinova[node:field-speakers-institution]
Michelle Valentinova started her career as a Backend Web Developer. She refocused on Systems Engineering in 2011 and has worked in Amazon, Schibsted Media Group, and most recently New Relic. In New Relic, Michelle is a Senior Site Reliability Engineer in the Kafka Platform Team, making sure that teams have the best possible experience using the Kafka service. Kafka is a key part of the ingestion and processing pipeline in New Relic.
Panel Discussion: Open-source Development as a Full-time Pursuit
Moderator: Daria Barteneva, Microsoft
Panelists: Costa Tsaousis, Netdata; Matteo Collina, Platformatic; Michael Yakshin, Microsoft
Join us for a panel event focused on open-source development as a full-time pursuit. Panellists will share their firsthand experiences, challenges, and successes in transitioning from part-time contributors to full-time open-source advocates.
Daria Barteneva, Microsoft
Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly. Daria is originally from Moscow, Russia, having spent 20 years in Portugal, 10 years in Ireland, and now lives in the Pacific NorthWest.
Costa Tsaousis, Netdata
Costa is the Founder and CEO of Netdata. Since 1995, he has been actively working on internet related startups. He has been a co-founder and C-level executive of many successful projects, including Internet Service Providers, Cloud Hosting Providers and Fintech startups. With passion for innovation, he now leads Netdata, an open-source monitoring solution aiming to simplify and modernise infrastructure observability for everyone.
Matteo Collina, Platformatic
Matteo is the Co-Founder and CTO of Platformatic.dev with the goal to remove all friction from backend development. He is also a prolific Open Source author in the JavaScript ecosystem and modules he maintain are downloaded more than 17 billion times a year. Previously he was Chief Software Architect at NearForm, the best professional services company in the JavaScript ecosystem. In 2014, he defended his Ph.D. thesis titled "Application Platforms for the Internet of Things". Matteo is a member of the Node.js Technical Steering Committee focusing on streams, diagnostics and http. He is also the author of the fast logger Pino and of the Fastify web framework. Matteo is an renowed international speaker after more than 60 conferences, including OpenJS World, Node.js Interactive, NodeConf.eu, NodeSummit, JSConf.Asia, WebRebels, and JsDay just to name a few. He is also co-author of the book "Accelerating Server-Side Development with Fastify" edited by Packt. In the summer he loves sailing the Sirocco.
Michael Yakshin , Microsoft
Michael Yakshin is a Principal Software Engineering Manager at Microsoft in Dublin, leading a team focusing on client-side agent and SDKs for metrics part of Microsoft's Observability systems. In his past life, Michael wore many hats, including software development for web analytics, embedded controllers, specialized Linux distributions, security and reverse engineering. Free and open source software is Michael's passion, currently pouring it into Kaitai Struct, a free declarative language bringing understanding to binary structures, one byte at a time.
17:45–18:45
Lightning Talks
The Liffey A
- Status Pages 101
Ashley Sawatsky, Rootly - Three Phases to Better Observability Outcomes
Eric D. Schabell, Chronosphere - Replacing Tests with Quality Gates
José Velez - Gosh, We Now Have an SRE Team What Do We Do With It?
Andrew Howden - Building for Fun with Fewer Nines
Elise Burke, Datadog - We Need to Talk about SRE Burnout
Ludo Frinea - Introspect or Retrospect? Scaling the Knowledge beyond One Team Boundaries!
Daria Barteneva, Microsoft - How to Troubleshoot When Your Plane Keeps Turning Upside Down
Joan O’Callaghan, Udemy
09:00–10:30
The Liffey A
Should I Use OTel (collectors), or Is Prometheus Good Enough?
Krisztian Fekete, solo.io
Everyone is talking about OpenTelemetry. It's a popular topic, even though most people is not aware of the differences between OpenTelemetry tools, APIs, and SDKs, so participating in these conversation can be confusing at times.
Others think that OpenTelemetry (the collector, in this case) is a silver bullet, and everyone needs to remove all of their Prometheus instances, telemetry agent, and deploy OTel collectors everywhere. Are they aware of the functionalities they miss or the failure modes they will introduce? Not necessarily.
As a former SRE I monitored tens of millions of users with the traditional Prometheus stack. In my current position I am working heavily with OTel to design telemetry pipelines at scale with customers. The goal of this talk is to make it easier for SREs to make informed decisions about the pros and cons for the tooling they are considering using for the job at hand.
Krisztian Fekete, solo.io
Krisztian is enthusiastic about observability and cloud infrastructures. He's now working at Solo.io as an engineer. Previously, he was working at LastPass as senior DevOps/SRE engineer. He is building a self hosted blog on top of Istio in his spare time. The main topics of the blog are aligned with his interests while he is also using the platform to share operational anecdotes on running one of the most "over-engineered" blog out there.
Implementing Open-source Observability within Maersk
Charlie Nederhoed and Martin Jaeger, A.P. Moller - Maersk
Maersk is the global leader in integrated supply chain logistics, helping customers to move goods around the world. Our mission is to become the global integrator of container logistics.
In the realm of large organizations, the challenges of standardizing observability and monitoring practices can be immense. This talk will dive into our journey of simplifying and standardizing observability across diverse hosting environments like Azure, AWS, GCP, on-prem and our edge locations.
Being part of the platform engineering team who’s responsible for observability within the company. The team has built a 100% self-service observability platform, using Loki/Grafana/Tempo/Mimir (LGTM stack) as its foundation, and in-house built tools to govern ingestion and querying.
Charlie Nederhoed, A.P. Moller - Maersk
Charlie Nederhoed is a systems/SR engineer who has 10 years experience in running production systems and the work that comes with it, like automation, reliability and cloud infrastructure etc.
He has been within Maersk for the past 5 years of which the last 4 years focused on Kubernetes and observability and is now leading the development of the company’s observability platform.
Martin Jaeger, A.P. Moller - Maersk
Martin Jaeger is a software developer who has spend 17 years developing software, running production systems, building cloud infrastructure and teaching others what he learned along the way.
He fell in love with Prometheus and Grafana back in 2013 and is now building observability systems using open-source software.
Journey from Fluent Bit, Fluentd and Prometheus to OpenTelemetry Collector - Lessons Learned
Marcin "Perk" Stożek, Canonical
Telemetry collection seems very straightforward at first glance: just use Fluent Bit or Fluentd for logs, Prometheus for metrics and call it a day. But problems arise when we want the best performance or simplicity of the solution. It’s hard to maintain three different agents, with three different configurations and documentation in three different places. In the talk, Perk discusses why his team replaced Fluent Bit, Fluentd and Prometheus with OpenTelemetry Collector and the challenges they faced along the way. Most importantly - is there a happy ending?
View the animated slides. (PDF slides available for download below.)
Marcin "Perk" Stożek, Canonical
Product manager by day, developer by night. Remote work practitioner.
My dev journey started with a copy-book and a pencil for C64 picture bytes calculations. Since then I have been involved in many different projects - ranging from security in message queues to Hadoop; from monolith to microservices; from bare metal to Kubernetes; from software tester, through dev and ops to lead and manager.
I love automation and believe that Linux is the best thing since sliced bread. Well, maybe after Vim.
The Liffey B
Level 7 Egress Control in Kubernetes: Current Solutions, Future Standards
Joshua Fox, DoiT International
Until recently, you could not control outgoing traffic to given Fully Qualified Domain Names (FQDN) using Kubernetes Network Policies. You could use ordinary firewalls, but these are defined by IP, not domain. Even network firewalls that recognize domains do not work in terms of Kubernetes, for example restricting the namespaces and labels which are allowed egress.
Cilium and Istio do provide this ability, but require the complexity of an additional network layer.
Just now, Google Kubernetes came out with a preview release FQDN-aware egress control in Kubernetes Network policies.
I will describe this and show how it fits into the effort now in progress in the Kubernetes Networking Special Interest Group to define FQDN egress control as a standard part of every compliant Kubernetes cluster.
Joshua Fox, DoIT International
Joshua Fox advises tech startups and growth companies about the cloud. Along with that, he writes open source, publishes technical articles, and speaks to cloud engineers as a Google Developer Expert.
His background includes a long career as a software architect in innovative technology companies.
He has a PhD from Harvard University and a BA in math from Brandeis.
Leveraging Unikernels and Kubernetes to (Transparently) Double Cloud Workload Performance
Felipe Huici, Unikraft GmbH
Large images and memory consumption, slow cold boots and autoscale, and underwhelming performance begs the question: is the way we do cloud deployments, while convenient, broken? In this talk we'll show that leveraging unikernels (specialized virtual machines) via the Linux Foundation's Unikraft project (www.unikraft.org) it is entirely possible to have your cake and eat it too: (1) memory consumption of only a few MBs, boot times in the milliseconds and throughput 50-100% higher than Linux's using unmodified applications; (2) deployment via Kubernetes, either on-premises or on public clouds; (3) all the while while taking advantage of the strong isolation of VMs. We will show how to get started (spoiler: a single command is enough to build and deploy) as well as a short demo.
In all, our hope is that Unikraft is a step towards efficient, cheaper and environmentally-friendly cloud deployments.
Felipe Huici, Unikraft GmbH
Dr. Felipe Huici is CEO and Co-Founder of Unikraft, a start-up dedicated to lightweight and open source virtualization tech. Prior he worked as chief researcher at NEC Laboratories Europe, has published in several top tier conferences such as SOSP, ASPLOS, OSDI, Eurosys, SIGCOMM, NSDI and CoNEXT, and has given talks at Open Source Summit, P99 and QCon, among others. Finally, Felipe is one of the founders and maintainers of the Linux Foundation Unikraft open source project.
Monoceros: Faster and Predictable Services through In-pod Load-balancing
Sotiris Nanopoulos and Ben Kochie, Reddit
Join us as we discuss how we optimized multi-process Python applications. Decreasing CPU variance, leading better CPU utilization, while improving latency by 20%. We’ll share our journey from a Ruby-based system to creating Monoceros to tackle Python’s limitations in large scale Kubernetes deploymnents. Learn about our initial problems which lead us to improve load-balancing and observabily with Monoceros. We’ll wrap up with a comparison of SO_REUSEPORT
load balancing and full load balancers. This talk is perfect for those interested in efficient system performance at scale.
Sotiris Nanopoulos, Reddit Inc
Sotiris is a software engineer focusing on building reliable, performant and secure cloud native networking infrastructure. Currently he is working at Reddit in the infrastructure transport team and before that he was contributing to Envoy Proxy on behalf of Microsoft. He is originally from Athens Greece but living in Toronto Canada.
Ben Kochie, Reddit Inc
Ben is a Principal Software Engineer for the infrastructure team at Reddit. He is passionate about scale, automation, open source, and observability. Ben has been working in software and systems engineering since 1996. Previously working in SRE at Google, SoundCloud, and GitLab. Currently located in Berlin, Germany.
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey A
Continuous Profiling in the Cloud-Native era
Matthias Loibl, Polar Signals
For years Google has consistently been able to cut down multiple percentage points in their fleet-wide resource usage every quarter, using techniques described in their “Google-Wide Profiling” paper. Ad-hoc profiling has long been part of the developer’s toolbox to analyze CPU and memory usage of a running process, however, through continuous profiling, the systematic collection of profiles, and entirely new workflows suddenly become possible.
The speaker will start this talk with an introduction to profiling with Go and demonstrate via Parca - an open-source continuous profiling project - how continuous profiling allows for an unprecedented fleet-wide understanding of code at production runtime.
Attendees will learn how to continuously profile code to help guide building robust, reliable, and performant software and reduce cloud spend systematically in various languages.
Matthias Loibl, Polar Signals
Matthias Loibl is a Senior Software Engineer who works on cloud-native observability at Polar Signals, previously at Red Hat and Kubermatic, and is a maintainer of many projects like Prometheus, Thanos, Prometheus Operator and Parca. He enjoys working on Distributed Systems with Go and gRPC.
How to Use Prometheus's Native Histograms
Björn Rabenstein, Grafana Labs
Both histograms and Prometheus have a special place in the SRE toolbox. But their mutual relationship over the last decade has been somewhat strained. While mathematically sound, the “classic” Prometheus histograms (as we call them now) suffered from several issues that made their practical usage tricky, even outright painful at times. Fortunately, there is a new kid in town: native histograms!
Learn how you can start using this cutting edge feature right away (without hurting yourself), and how to use it to tackle typical SRE tasks, such as SLO tracking and troubleshooting. The talk is focused on the practical usage aspects, but a short overview and in particular helpful pointers to other sources will be provided to help you understand the theoretical background as well.
Björn Rabenstein, Grafana Labs
Björn “Beorn” Rabenstein is an engineer at Grafana Labs and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.
The Liffey B
Overcoming Challenges in Serving Large Language Models
Theofilos Papapanagiotou, Amazon
Discover the secrets of hosting GPT-type models in a Kubernetes cluster with multi-GPU nodes. As the demand for custom GPT models grows, SREs are increasingly tasked with providing these capabilities in their organizations. We'll dive into the complexities of serving such models, including their large size and the need for GPU sharding and tensor parallelism. Learn about model file formats, model quantization techniques, and leveraging open-source tools like Huggingface Accelerate. Gain insights into the trade-offs between serving latency, prediction accuracy, and distributed serving, and explore best practices for optimizing resource allocation. Don't miss our live demo showcasing the performance and trade-offs of a GPT-based model. Empower yourself with practical knowledge to meet the demands of hosting language models effectively.
Theofilos Papapanagiotou, Amazon
Theofilos is an accomplished ML architect and an expert in serving large language models with a focus on scalability and performance optimization. With a strong background in ML infrastructure and MLOps principles, he brings a wealth of experience to the table. As a maintainer of the KServe project and contributor to Kubeflow, Theofilos is actively involved in advancing the field of model serving. His deep understanding of Kubernetes, GPU optimization, and open-source tools allows him to navigate the challenges of hosting custom GPT-based models with ease. Attend his talk to gain valuable insights, best practices, and practical knowledge that will empower you to scale and optimize your language models effectively.
The Value of Reliability
Niall Murphy, Stanza Systems
Niall Murphy will deliver an overview of how we measure and articulate the value of reliability in our organisations.
How do we evaluate down time? What are the highest value parts of your stack? How do you prioritise your engineering effort to best improve the set of likely outcomes?
Furthermore, when we know the value of our systems, how do we communicate it effectively to the rest of the business? What ways could there be to prioritise reliability work over feature work?
Niall Murphy, Stanza Systems
Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.
12:30–14:00
Luncheon
The Forum
14:00–15:30
The Liffey A
A Dual Approach to Accountability Engineering
Anthony Sandoval, Reddit Inc.
Perhaps you've found yourself drawing teams on a whiteboard and thought: "Things would be better if SRE at my $companyName was structured like this."
You might be right, but you're probably wrong. It's likely not a restructuring you need, but rather securing agency and a diffusion of accountability. SRE's prevalence in an era dominated by microservices shouldn't be surprising. The DevOps movement spread responsibilities across previously established boundaries, and created a nebulous space riddled with confusion and a lack of clear ownership.
How can SRE teams be both philosophical and pragmatic?
Should SRE teams be principals or partners?
Does a performant SRE team promote or punt?
Please permit me to persuade you on your process of prioritization.
Anthony Sandoval, Reddit Inc.
Anthony has been leading infrastructure and SRE teams for the past decade at Groupon, GitLab, and Reddit, where his focus has been establishing SRE during preiods of rapid organizational growth. Prior to which, he was an amateur technologist working in advertisting, market research, and political consulting. He lives in the suburbs of Chicago with his wife, their two children, and very little dog.
Succeeding as the Lone SRE in a Small Team
Danny Kirchmeier, Outschool
Small team dynamics bring their own set of challenges: lack of resources, limited support structures, high pressures, and constant change. Its vital to setup a definition of personal success to be able to weather the storms or failures that come your way. This talk will cover one of the less intuitive but highly effective practices I have embraced.
Danny Kirchmeier, Outschool
Danny has 14 years working in startups and small teams, ranging from a tiny company of just 4 people to a more modest company with over 40 engineers. He's spent the last 8 years focusing on all things SRE related and loves having the opportunity to share and teach others both the technical and personal skills needed to succeed.
Deconstructing an Abstraction to Reconstruct an Outage
Chris Sinjakli, PlanetScale
We all rely on abstractions to build the applications we use day-to-day. It's easy for those abstractions to feel like impenetrable walls, hiding scary low-level parts of the system - especially for a complex piece of software like a database. That needn't be the case!
In this talk, we'll explore the aftermath of a complex outage in a Postgres cluster. We'll retrace the steps we took to reliably reproduce the failure in a local environment and pull out lessons about debugging complex systems along the way. At one point, we'll dive into the depths of how Postgres represents data on disk and realise that even unfamiliar layers of a system don't need to be scary.
Chris Sinjakli, PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.
All his programs are made from organic, hand-picked, artisanal keypresses.
The Liffey B
When Clouds Stop Raining Discounts: Surviving the Drought
Max Blaze, Duolingo
We seemed to be all set in terms of AWS cost optimization–we had just finished migrating our largest, most costly services to the less expensive Graviton architecture and our projected savings forecasts for the year were looking fantastic. That is until March, when drastic changes in the AWS Spot market overturned all of our underlying assumptions. Suddenly, our “cheaper” machines were costing more and most of our discounts completely disappeared by the end of April. The unthinkable had happened and we had no idea if it was temporary or the new normal.
This talk will explore the most likely underlying causes of the AWS Spot market instability, the tools and techniques we used to see the full view of its effects on discounts, and the major actions that we took to keep our infrastructure costs under control when faced with increasing uncertainty.
Max Blaze, Duolingo
Max is a Senior Staff Operations Engineer at Duolingo and is currently optimizing infrastructure for cost, stability, and compliance through standardization and automation. Before diving into the chaotic world of cloud computing, he managed application operations for multiple physical data centers in the healthcare sector. Max holds an MS and Certificate of Advanced Study in Telecommunications from the University of Pittsburgh, where he focused on network security and critical infrastructure.
Should an SRE Care About FinOps? Using Observability to Enable Resources Optimization
Rodrigo Serra, Itaú
We are going to share the whole journey of Finops on AWS with K8s. How we had savings using open source tools.
Rodrigo Serra, Itaú
Rodrigo Serra: Still a sysadmin and now a SRE focused on observability. I have 18 years in technology area working with big systems. He is an Principal SRE at Banco Itau.
Looking at SRE Needs and Trends over Two Decades with a Single Service
Salim Virji, Google LLC and Murali Suriar, Snowflake
We all have experienced in our organisations the case where we build a quick solution to solve an immediate problem, and eventually find the software fulfilling other needs. This is the story of Chubby, Google's distributed lock service, and how it began as a mechanism to provide leader election for infrastructure and evolved rapidly to provide service discovery, config-file distribution, and other production-critical services.
During this talk, the presenter will explore the evolution and maturity of the field of Site Reliability Engineering through the lens of this specific piece of infrastructure software. The audience will hear foundational experiences with monitoring, caching, proxying, and isolation — and learn about our experiences, both good and bad. The audience will also hear suggestions for the direction that SRE practice will take in the near future.
Salim Virji, Google LLC
Salim Virji develops reliable engineering practices and processes for Google’s SRE organization, and has previously built distributed consensus and storage systems. Salim’s other interests include machine learning and composting.
15:30–16:00
Break with Refreshments
Level 3 Foyer
16:00–17:30
The Liffey
How to Make Your Automation a Better Team Player
Laura Nolan, Stanza
The role of SREs and other software operators has moved from making direct changes to running systems and configurations towards building tools and control planes that actually run our systems, or reusing automation systems built by others.
We haven't talked too much about this as a field, but it is a profound shift in the nature of our work. Now we need two skillsets: our traditional software and systems skills that we use to plan work and react to anomalies; plus a newer skillset that focuses on building and managing automated systems that make the overall system - software and humans - more reliable.
We are not yet good at build automation to be a good team player. This talk explores the problems in this domain and some ways that we can make progress.
Laura Nolan, Stanza
Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand and control their production systems. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.
Dark Matter and Deep State: The Unseen Majority of Everything
Heidi Waterhouse, Waterhouse Consulting
We pay attention to new, novel, interesting, or surprising things. It's much harder for us to notice the ordinary and expected. This plays out in fields as diverse as government, astronomy, and SRE. I will talk through how important it is to pay attention to the space between the stars, the companies in flyover country, the laws that almost no one is upset about. If something is known and predictable, is it easy to automate or add to machine learning and predictive behaviors? What should we let the robots take care of, and what requires a human touch?
Heidi Waterhouse, Waterhouse Consulting
Heidi is an advocate for progressive delivery, organizational transformation, technical communication, and marketing you hate slightly less. She is a nerd about industrial psychology and patterns of progress. One of her favorite hobbies is talking to developers about things they already knew but had never thought of that way before. She sews all her conference dresses so that she's sure there is a pocket for the mic.
17:30–17:40
Closing Remarks
The Liffey
Effie Mouzeli, Wikimedia Foundation, and Vanessa Yiu, UBS