As a follow-up to Why Quality Automation Should Be Among Your 10 First Hires, I’d like to discuss why an SRE (Software Reliability Engineer) should be among your first 15 engineering hires.
With the evolution of cloud computing and SaaS technologies, combined with the practice treating infrastructure as code (See Chef, Puppet, Salt, Ansible), the term DevOps has emerged as a way to view many aspects of operations as a software problem.
SRE is a position first created by Google to focus on site reliability challenges (which are many at Google’s scale). The list of responsibilities includes, but is not limited to availability, performance, capacity planning, system design advisor, monitoring, and alerting. In Google’s words: “In SRE, we flip between the fine-grained detail of disk driver I/O scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions.“
If you are a SaaS company and have approximately 15 engineers, it’s likely that you are in your second or third year of existence. At this point, you likely have paying customers who demand few nines of uptime and the bar for site performance and resilience is high. Downtime has a cost higher than money; it’s a knock on your credibility with customers that will take time to repair. Also, if your customers are not able to access your product, there may be contractual breaches which are never pleasant conversations to have.
Engineering teams should lean on SREs at the beginning of the product or feature design- before code reaches production. A good SRE can help formulate the Service Level Agreements(SLA) and Service Level Indicators (SLI) with which to model your code and solution upon (If those terms feel foreign to you, watch this excellent video).
These performance indicators can help engineers determine the best architectural model or design paradigm. For example, if the requirement is to ingest terabytes of data as quickly as possible in, say, one minute, you can rule out technologies where such operation will be prohibitively expensive and where certain consistencies and transactionality are not required.
DevOps Engineer vs SRE?
In my view, SRE is a specialization within DevOps, not a competing function. In the early days, there was no real distinction between a DevOps engineer and an SRE.
A DevOps engineer typically focuses on provisioning a system or development environment. An SRE is more focused on the production environment. They will specialize in one or more production products, understand the runtime environment, engage in capacity planning, and manage the alerting system to monitor achievement of the previously defined SLAs and SLIs.
As your company grows, both roles will diverge. DevOps engineers concentrate more on creation and testing, whereas SREs focus more on the running of systems. So an SRE will build a more predictable and debuggable environment and specialize on MTTR (minimal time to recover) if problems occur.
Bringing product and engineering teams together
Once again, if your company has 10 to 15 engineers, you must be trying to do as much coding as possible to get features out to delight customers with whatever your core differentiation is. This is a great time to bring product and engineering together and start formulating what your SLA towards your customers are and what Service Level Indicators you must satisfy as a company to meet these obligations.
Product teams will have a lot to say in the formulation of SLAs as these directly impact customers and their user experience and engineers will have to commit and maintain such agreements. It’s the perfect time to collaborate and build a culture focusing on the ultimate user experience and build a habit of evaluating such indicators as a cross-team discipline. This is why I believe hiring an SRE earlier in your hiring plan will pay off dividends in the long run.