As more and more firms turn to the cloud to push out services across the globe, there is an increasing focus on the tools, technologies, and practices that facilitate the process. Systems must increasingly bear the weight of new customers and data integration
while also maintaining a consistent level of service. To do this effectively, it is vital that systems responsible for the service are monitored and failure mitigated through an intelligent assessment of alerts and incidents.
The prediction and detection of complex failure conditions is a lofty task for IT operations teams, but emerging DevOps-related disciplines and practices are lighting the path of progress. Two key developments are the discipline of site reliability engineering
(SRE) and the uptake of AIOps.
The responsibility of SREs
When scaling up, service providers must prioritise uptime as the reliability of the service is, of course, essential to customers. But scaling up means systems become more complex, which raises the likelihood of incidents that could bring down the service.
While one alert in production may not reveal a serious fault, a multitude of alerts across multiple systems and networks may indicate a more critical failure is imminent. Understanding this, and reacting accordingly, is a core principle of site reliability
Site reliability engineers understand that failure is likely to occur and differ from traditional IT engineers in that they come from a software development background. They possess the skills to engineer failure out of systems, rather than to simply minimise
the risk of failure through mitigating changes in production. As a result, SREs are increasingly being placed alongside or within engineering teams, with ownership of services falling to them, rather than external operations teams.
The principles of SRE are drawn from industries in which the failure of systems has critical implications, such as in aerospace. Just as aeronautical engineers build more resilient airplanes using the data from black boxes, SREs create more resilient systems
through analysing what happened before, during, and after critical incidents in production.
Minimising failure is the aim of the game. The most effective way to do this is through the intelligent correlation of alerts created by monitoring tools to determine causal relationships. This is where the implementation of AIOps is becoming increasingly
Automate to remediate
It is the role of SREs to determine the next steps once incident data has been analysed, but as systems increase in complexity and scale, this task becomes much more difficult. To assist in this effort, AIOps tools are increasingly employed to deal with
big data; automating responses to alerts and, once algorithms mature, remediating incidents.
The less context delivered with alerts, the more time engineers will need to resolve the issue, as the manual remediation process will likely involve interaction with other teams. Traditional operations teams do not have an end-to-end understanding of applications
that are live in the production environment. This means they will likely need to sit down with members of the developer team who built the application that is the source of an alert, which of course extends the lead time of the remediation process. SREs, however,
do have this understanding, which means they can focus more on building in resilience, rather than investigating alerts.
SREs can also build in automation for simpler tasks within a system when it comes to monitoring and fault prevention. Setting alert thresholds for features, for example, is a simple way to automate responses. But to enable more advanced detections and remediations,
more advanced techniques must be utilised.
Human decisions for business outcomes
By applying data collection, data modelling and data analytics techniques, and using machine learning algorithms to establish patterns, it is possible to cut through the cacophony of alerts produced across systems and automate more complex remediations.
But algorithms are only as good as the data that feeds them. Putting in place the right telemetry tools is vital to get this data, but determining where to divert resources once alerts and incidents have been analysed is still very much a business decision.
Providing a good service means fulfilling the service-level objectives (SLOs) agreed with customers, which guarantees a set amount of uptime. While traditional operations professionals might aim for maximum uptime, SREs will be more inclined to stress a
service once an SLO has been fulfilled. In short, if a certain amount of downtime is permissible, SREs might use this quota to push change through to production at a higher velocity, risking failure to gain valuable insights.
Finding failure is a critical part of SRE, but not all failure is critical. Alerts flag issues, but the vast majority of alerts do not indicate an immediate threat to production. Alerts that relate to latency, for example, will be frequent, but this is to
be expected. Also, owing to SLOs, services are able to accommodate some latency issues. The most important thing for users is that the service is still running.
There is no perfect system. In a business context, a service that never fails would require infinite resources to monitor and maintain. SREs work towards perfection knowing that it is unattainable, but AIOps tools and practices, as well as advanced technologies
and monitoring tools, can get them close.