Florian Kammermann and Romain Bonjour, Swisscom
In 2021 our Telco had several severe outages. This was our chance to introduce SRE practices. The Organization had followed a traditional ITIL Operation model until then.
To simplify and scale the adoption of SRE practices, we came up with the idea to create "Reliability Enhancing Procedures". They are Cookbook style work instructions. The teams can work through the Cookbook themselves and improve the reliability of their Services. They are meant to scale across organization and teams.
The Reliability Enhancing Procedures should be based on the SRE practices, but also from other inspirations. Over the time we introduced nine Reliability Enhancing Procedures. All these REP's have to be executed on our Services (hundreds of them).
It may just look like a program to increase reliability, but at the end it is a huge transformation initiative how to manage the reliability of our services in an SRE style.
Florian Kammermann, Swisscom
Florian was software and devops engineer for a long time. Two years ago he saw the opportunity to drive the SRE Adoption company wide and took the role as Reliability Enterprise Architect. With a heavy heart, he said goodbye to terminal and code and became an advocate for site reliability engineering practices and data driven operation.
Romain Bonjour, Swisscom
Romain is currently leading the site reliability engineering (SRE) transformation of mobile network infrastructure at Swisscom, implementing best-practices, tools, and teachings around reliability of complex systems. He strongly believe in automated release engineering for both could native system and legacy technologies as the key enabler for faster releases of high availability systems.
author = {Florian Kammermann and Romain Bonjour},
title = {Implementing {SRE} in a Telco with Reliability Enhancing Procedures},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}