The human factor: How companies can prevent cloud disasters

Large companies work very hard to be certain that their service does not decline, and the reason is easy – significant failures will damage your brand and drive customers to competing products with a higher track record.

Building a reliable Internet service is a difficult technical problem, but it is also a human challenge for company leaders. Motivating engineering teams to take a position in reliability work can be difficult because it is often perceived as less exciting than developing latest features.

- Advertisement -

On a large scale, incentives dominate. The largest technology companies employ 1000’s of employees and operate a whole lot of web services. Over the years, they have come up with clever ways to be certain that engineers build reliable systems. This article discusses human engineering techniques that have proven successful at scale in some of the most successful technology companies in history. You can apply them to your organization, whether you are an worker or a leader.

Spin the wheel

The AWS Operational Review is a weekly meeting open to the entire company. Every meeting, a “wheel of fortune” is rotated to pick out a random AWS service from a whole lot for live review. The team being reviewed must answer specific questions from experienced operational leaders about their dashboards and metrics. The meeting is attended by a whole lot of employees, dozens of directors and several vice presidents.

This motivates each team to have a basic level of operational competence. Even if the probability of choosing a particular team is low (lower than 1% in AWS), as a manager or technical team leader you actually don’t desire to look clueless in front of half the company on the day your luck runs out.

It is necessary to review reliability metrics frequently. Leaders who are actively interested in operational health set the tone for the entire organization. Spinning the wheel is just one tool to attain this.

But what do you do during these operational reviews? This brings us to the next point.

Define measurable reliability goals

You would really like to have “high availability” or “five nines”, but what does that actually mean for your customers? Latency tolerance for live interactions (chat) is much lower than for asynchronous workloads (training a machine learning model, streaming video). Your goals should reflect what your customers care about.

When reviewing your team’s metrics, ask them to explain measurable reliability goals. Make sure you understand – and they understand – why these goals were chosen. Then ask them to make use of dashboards to prove that these goals were achieved. Having measurable goals will assist you to prioritize reliability work in a data-driven way.

It’s a good idea to focus on spotting problems. If you notice an anomaly in their dashboards, ask for an explanation of the problem, but also ask whether their on-call team has been notified of the problem. Ideally, it is best to realize something is mistaken before your customers do.

Embrace the chaos

One of the most revolutionary changes in cloud resiliency considering is the concept of fault injection into production. Netflix formalized this idea as “chaos engineering” — and the idea is as cool as the name suggests.

Netflix desired to encourage its engineers to create fault-tolerant systems without resorting to micromanagement. They concluded that if system failure becomes the norm somewhat than the exception, engineers have no alternative but to build failure-tolerant systems. It took some time to figure this out, but at Netflix, the whole lot from individual servers to entire availability zones are routinely unnoticed of production. Each service is expected to mechanically absorb such failures without affecting service availability.

This strategy is expensive and complex. However, if you are shipping a product for which high availability is an absolute necessity, rolling failures into production is a very effective approach to achieve a kind of “proof of concept.” If your product needs it, introduce it as early as possible. It won’t ever be simpler and cheaper than today.

If chaos engineering looks as if overkill, it is best to at least require your teams to have “game days” (simulated downtime training) once or twice a 12 months or before any major feature rollout. During the game day, you’ll have three assigned roles – the first one simulates a breakdown, the second one fixes it without knowing in advance what is broken, and the third one observes and takes detailed notes. The entire team should then come together and perform an autopsy on the simulated incident (see below). A game day will expose gaps not only in how your systems handle failures, but also in how your engineers handle them.

A rigorous autopsy should be performed

An organization’s autopsy reveals a lot about its culture. Every leading technology company requires teams to perform post-mortems in the event of serious outages. The report should describe the incident, investigate its root causes, and discover preventive actions. Autopsies needs to be rigorous and carried out to a high standard, but on no account should they point to specific people as guilty. Posthumous writing is a corrective exercise, not a punitive one. If the engineer made a mistake, there are underlying problems that allowed that mistake to be made. Perhaps you would like higher testing or higher security around critical systems. Analyze these system vulnerabilities and fix them.

Designing a robust post-mortem process could possibly be the topic of its own article, but it’s protected to say that having one in place will go a good distance toward stopping further downtime.

Reward work for reliability

If engineers feel that only latest features result in raises and promotions, reliability work will fall by the wayside. Most engineers should contribute to operational excellence, no matter seniority. Reward reliability improvements in performance reviews. Hold senior engineers accountable for the stability of the systems they oversee.

While this suggestion could seem obvious, it is surprisingly easy to miss.

Application

In this text, we have covered some essential tools that embed reliability into your organization culture. Startups and early-stage companies often don’t treat reliability as a priority. This is comprehensible – your fledgling company must obsess over proving product-market fit to make sure survival. However, if you develop a base of returning customers, the way forward for your online business depends on maintaining trust. People gain trust by being reliable. It’s similar with web services.

People who determine about data

Welcome to the VentureBeat community!

DataDecisionMakers is a place where experts, including data scientists, can share data-related insights and innovations.

If you should read about progressive ideas and current information, best practices and the future of knowledge and data technologies, join us at DataDecisionMakers.

You might even consider writing your personal article!

The human factor: How companies can prevent cloud disasters

Spin the wheel

Define measurable reliability goals

Embrace the chaos

A rigorous autopsy should be performed

Reward work for reliability

Application

Latest Posts

Recomended