Site Reliability Engineering Best Practises and Tools

September 14, 2018 

Site Reliability Engineering Best Practises and Tools

What is Site Reliability Engineering (SRE)?

The concept of Site Reliability Engineer has been around since 2003. Google’s Site Reliability Team described by Ben Treynor. SRE enables software engineer to design an operations function.

The primary goal of SRE is to fill the gap between the developer teams and Sysadmin teams. SREs uses software engineering techniques, to make sure the stability and availability of the production environment simultaneously adding new features and operational improvement needed.

SREs are the software engineers knowing sysadmin, that substitute automation for human labor and focuses specifically on system reliability.


Key Site Reliability Engineering Skills

The type of skills required will differ organization to organization, as is widely based on the type of application a particular organization is using, and how and where it is deployed and monitored. The other essential skills for SREs are to be more focused on application Monitoring and Diagnostics. Apart from the specific technical skills which depend on organizations practices, below are some non-technical and some basic technical skills one should look in SRE -

Non-technical -

  • Problem Solving
  • Teamwork
  • Work well under pressure and solve problems.
  • Translate the technical into business language, means one should know technology and also know how to pitch it to users or clients.
  • Have excellent written and verbal communication.

Fundamental technical skills -

  • Know version control.
  • Knowledge of Linux (most preferably).
  • Automate things over the manual work.
  • CI/CD Knowledge.
  • Knows how to troubleshoot.

How SRE Works?

An SRE team provides availability, performance, effectiveness, emergency response, and monitoring of their service, in short SRE is accountable for all those things which make their services up and reliable for their user. So to fulfill all these the SREs work according to these following principles or we can say it is the base for the foundation of SRE -

  • Embracing Risk
  • Service Level Objectives
  • Eliminating Toil
  • Monitoring Distributed Systems
  • The Automation
  • Release Engineering
  • Simplicity

Service-Level Objective (SLO)

Set a specific numerical target for system availability. This particular numerical value termed as Service-Level Objective (SLO), i.e., it defines a target level for the reliability of your service. More reliable service will cost more for operation; hence SLA should be set carefully.

Service-Level Agreement (SLA)

An SLA involves a response to service that it's available typically. SLO meets a certain level over a certain period, and if it fails to provide what's in SLO, then some fine will be given as a refund. Define SLA’s availability SLO; be more careful about which queries count as genuine.

Service-Level Indicator (SLI)

It's an indicator of the level of services provided and to know whether the system ran under SLO for the past or not, look at the SLI to get the service availability rate. If it goes below the defined SLO, then there is a problem, and it must be solved (someone needs to make the system more available in some way to solve the problem).


Benefits of Site Reliability Engineering

  • Fill the gap between developer and sysadmin.
  • Automate the process.
  • Continuously monitor and analyze application performance.
  • Plan and maintain operational runbooks.
  • Contribute to overall product roadmap.
  • Manage on-call and emergency support.
  • Ensure software has good logging and diagnostics.

Why Site Reliability Engineering Matters?

A sysadmin is responsible for the configuration, keeping the system up, and reliability of complex computing systems. Do all deployments, monitor services and respond if something goes wrong. So according to Google's SRE Book, the traditional approach caused gap and conflict between developers and sysadmins due to different skills. Developers suggest new features should be available to users as frequently as possible whereas the sysadmins team members avoid breaking anything.


How To Adopt SRE?

Google is the first to embrace the culture of SRE, but what works for Google may not work for all other organization, means the adaptation of SRE in an organization depends on different factors such as: Organization size, Technology used, culture or other factors. To adopt SRE, refer to the defined principle and practices which Google is practicing and match which of the established methods organization follows.

Analysis -

  • Current Situation of Organisation
  • Challenges
  • Capabilities
  • Capability Gaps

Hiring SRE team by analysis -

It's up to the organizations, hire the most efficient and only required members, based on the analysis.

Recommendations -

  • Which SRE principles are most suitable for your situation?
  • Which SRE practices are feasible for the organization?
  • Which can make the most significant difference, the soonest?
  • How can someone best integrate SRE into an organization’s culture?

Implementation -

  • How should one structure SRE team?
  • What skill gaps have you got, and what's the best ways to fill them?
  • How can you make the existing team up to speed with SRE?
  • What should you be looking for in new hires?

Site Reliability Engineering Best Practises

SRE is concerned with speed performance, security, capacity planning, software and hardware upgrades, and availability which result in reliability which every organization is willing to achieve.

Google has defined more than nine practices for SRE, categorize these practices for better understanding.

Controlling Overload Operation

Hire coders as the primary duty of an SRE is to write code. About 5% of the ops work should go to the dev team, plus all overflow. To cap the SRE-operational load at 50% (usually 30%). The on-call team has a minimum of eight engineers for one location, handling a maximum of 2 events per shift.

SLA-Driven operation monitoring

  • Have an SLA for your service; it may vary for different services.
  • Measure and report the performance against SLA.
  • Use error budgets and get launches on them.

Ways to handle incident/blackout smoothly

  • Do Post Mortems for every event.
  • Postmortems cannot be blamed so focus on process and technology, and not on people.
  • Aim for the maximum of 2 events per on-call shift.

Management/budget policies that put values on operations

To hire your SREs and your developers from the same staffing pool and treat them all as developers. NOTE: Take care of SLA, SLI, and SLO very carefully and logically to define all the values and agreements.


SRE Tools

Understand SRE approach not written in stone, its all on organization means see on which tech organization is working, and accordingly adopt the required tools.