How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

FireHydrant lands $1.5M seed investment to bring order to IT disaster recovery

FireHydrant, a NYC startup wants to help companies recover from IT disasters more quickly, and understand why they happened with the goal of preventing similar future scenarios from happening again. Today, the fledgling startup announced a $1.5 million seed investment from Work-Bench, a New York City venture capital firm that invests in early stage enterprise startups.

In addition to the funding, the company announced it was opening registration for its FireHydrant incident management platform. The product has been designed with Google’s Site Reliability Engineering (SRE) methodology in mind, but company co-founder and CEO Bobby Ross says the tool is designed to help anyone understand the cause of a disaster, regardless of what happened, and whether they practice SRE or not.

“I had been involved in several fire fighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: ​absolute chaos​,” Ross wrote in a blog post announcing the new product.

The product has two main purposes, according to Ross. It helps you figure out what’s happening as you attempt to recover from an on-going disaster scenario, and once you’ve put out the fire, it lets you do a post-mortem to figure out exactly what happened with the hope of making sure that particular disaster doesn’t happen again.

As Ross describes it, a tool like PagerDuty can alert you that there’s a problem, but FireHydrant lets you figure out what specifically is going wrong and how to solve it. He says that the tool works by analyzing change logs, as a change is often the primary culprit of IT incidents. When you have an incident, FireHydrant will surface that suspected change, so you can check it first.

“We’ll say, hey, you had something change recently in this vicinity where you have an alert going off. There is a high likelihood that this change was actually causing your incident. And we actually bubble that up and mark it as a suspect,” Ross explained.

Screenshot: FireHydrant

Like so many startups the company developed from a pain point the founders were feeling. The three founders were responsible for solving major outages at companies like Namely, DigitalOcean, CoreOS, and Paperless Post.

But the actual idea for the company came about almost accidentally. In 2017, Ross was working on a series of videos and needed a way to explain what he was teaching. “I began writing every line of code with live commentary, and soon FireHydrant started to take the shape of what I envisioned as an SRE while at Namely, and I started to want it more than the video series. 40 hours of screencasts recorded later, I decided to stop recording and focus on the product…,” Ross wrote in the blog post.

Today it integrates with PagerDuty, Github and Slack, but the company is just getting started with the three founders, all engineers, working on the product and a handful of Beta customers. It is planning to hire more engineers to keep building out the product. It’s early days, but if this tool works as described, it could go a long way toward solving the fire fighting issues that every company faces at some point.

Amazon reportedly acquired Israeli disaster recovery service, CloudEndure for around $200M

Amazon has reportedly acquired Israeli disaster recovery startup, CloudEndure. Neither company has responded to our request for confirmation, but we have heard from multiple sources that the deal has happened. While some outlets have been reporting that the deal was worth $250 million, we are hearing that it’s closer to $200 million.

The company provides disaster recovery for cloud customers. You may be thinking that disaster recovery is precisely why we put our trust in cloud vendors. If something goes wrong, it’s the vendor’s problem, and you would be right to make this assumption, but nothing is simple. If you have a hybrid or multi-cloud scenario, you need to have ways to recover your data in the event of a disaster like weather, a cyber attack or political issue.

That’s where a company like CloudEndure comes into play. It can help you recover and get back and running in another place, no matter where your data lives, by providing a continuous backup and migration between clouds and private data centers. While CloudEndure currently works with AWS, Azure and AWS, it’s not clear if Amazon would continue to support these other vendors.

The company was backed by Dell Technologies Partners, Infosys and Magma Venture Partners, among others. Ray Wang, founder and principal analyst at Constellation Research, says Infosys recently divested its part of the deal and that might have precipitated the sale. “So much information is sitting in the cloud that you need backups and regions to make sure you have seamless recovery in the event of a disaster,” Wang told TechCrunch.

While he isn’t clear what Amazon will do with the company, he says it will test just how open it is. “If you have multi-cloud and want your on-prem data backed up, or if you have backup on one cloud like AWS and want it on Google or Azure, you could do this today with Cloud Endure,” he said. “That’s why i’m curious if they’ll keep supporting Azure or GCP,” he added.

CloudEndure was founded in 2012 and has raised just over $18 million. It most recent investment came in 2016 when it raised $6 million led by Infosys and Magma.