They Wrote The Books On It

Primary is backing a team of industry experts to reinvent SRE as we know it. Despite the importance of platform engineering for large organizations, it’s still inaccessible.

BY Brian Schechter

BY Tobias Citron

It's not every day that you get to back a team of founders who wrote the books on a massive and growing market.

In doing our diligence on Stanza, time and time again, we met with engineering leaders who knew Niall Murphy and his team. And when they heard that they were working to create software to standardize best practices for SRE, they wanted to get in line.

So what is SRE, and why does it represent a massive opportunity? SRE (site reliability engineering) drives revenue. Think about Robinhood delivering their app to a customer; does the app load in 5 seconds or 5 milliseconds? Does it crash every 10 times or every 1000? The answers to these questions determine whether a customer chooses Robinhood or some other robo-advisor, and it’s the job of SRE to make the application - which relies on hundreds of interdependent microservices, databases, and servers - as fast, reliable, and secure as possible.

Google was the first to develop the SRE team in 2003, and the practice has become a standard across tech more broadly since. SRE also works - change failure rates decrease by over 2x for companies after SRE implementation. However, as infrastructure becomes more complex, avoiding downtimes and security issues is harder now than it’s ever been, meaning SRE has an increasingly large role to play. LinkedIn reported that SRE job openings increased over 70% in 2019.

Unfortunately, the true promise of SRE for most software engineering teams remains elusive. Over 90% of engineering organizations rely on ad-hoc and as-needed reliability solutions. This is the case because hiring SREs is prohibitively expensive for startups, and the job of an SRE can be suboptimally outsourced to other engineers on the team. Overall, even though SRE is powerful, it’s generally inaccessible and underutilized. This results in massive organizational costs. System downtime costs close to $6,000 per minute. Across hundreds of applications and systems within a scaled organization, that can become a massive line item.

Enter Niall Murphy, who was previously Global Head of Azure SRE at Microsoft. Before joining Microsoft, Murphy worked on SRE at Google, where he wrote the book on SRE. Niall is starting Stanza to offer a new type of SRE-as-Software to the long tail of engineering organizations.

Niall is starting Stanza with some old friends and colleagues from his days at Google and Azure. Blake Bisset was a leader of SRE at Azure after working with Niall at Google and taking a detour at Dropbox as Global Head of Reliability Engineering. Other former colleagues like Maggie Johnson-Pint and Joseph Bironas round out a founding team that couldn’t be more ideal for this startup if you drew it up from scratch.

There are now close to 30 million software developers worldwide. Reinventing SRE to make it more accessible to this massive group of people roughly the size of Texas is a big swing - we are talking about new code libraries, built out observability tooling, and so much more to embed SRE into software development in a totally new way than ever before.

However, if anyone will pull it off, it’s this team. Stanza is on a mission to make reliability not just the privilege of the richest, biggest tech companies in the world, but a standard accessible to all companies building software. We couldn’t be more excited to be along for the ride.

Tags: First Look