Rapid response: How we fixed our on call process to avoid engineer burnout

Intercom’s mission is to make internet business personal. But it’s impossible to be personal when your product is broken. Uptime is critical to the success of our business, and not just because our customers are paying us, but also because we heavily dogfood our own product. If our product is down, we acutely feel our customer’s pain.

Being on call out of office hours is inherently disruptive to your life

Uptime is influenced by many factors such as the software architecture and the quality of day to day operations. However, quite often it comes down to having a human on call, responding to an alert from PagerDuty. On call work like this can be a powerful customer orientated activity that connects engineers to the value customers get from your product. It can also be a great learning and growth opportunity – after all, outages and errors can be complex events to understand and remediate.

But at the same time, being on call out of office hours is inherently disruptive to your life. You need to be ready to respond quickly and competently to an alert about something being broken. Even without being paged, being on call creates anxiety – I know from personal experience that it is very disruptive to sleep, even if nothing actually breaks. Being on call regularly can lead to burnout, apathy or a general desire to never see a computer again.

The history of on call at Intercom

Back in the early days of Intercom, our CTO Ciaran was the entirety of the on call team, both in and out of the office. As Intercom grew, we built an operations team to help Ciaran out. Soon after, new teams started building a lot of new features and services, and they assumed full on call responsibilities.

There were too many people on call at any moment

This felt natural at the time, as it was a lightweight way to scale our on call team and was consistent with our values that emphasized the importance of ownership. Without deliberately planning it, we had ended up with four or five teams that were regularly being paged out of hours. The remaining product teams had a handful of alarms that rarely, if ever, resulted in an out of hours page.

We realized that we had ended up with an on call setup we weren’t proud of, and had a number of critical problems that we wanted to solve, such as:

  • There were too many people on call at any moment in time – our infrastructure wasn’t so large that it required at least five engineers having their weekends disrupted.
  • The quality of our alarms and on call procedures were inconsistent across teams and we were using ad hoc review processes for new and existing alarms. Runbooks (the procedures to follow when an alarm fires) were mostly conspicuous by their absence.
  • There were inconsistent expectations for engineers depending on which team they ended up working on. For example, only the original operations team had any form of compensation for doing on call shifts other than time off in lieu.
  • There appeared to be a general level of tolerance for unnecessary out of hours pages.
  • Finally, it doesn’t always suit everybody to do this type of work. Life circumstances can mean that on call shifts are just too disruptive for some people.

Finding the right on call process

We decided create a new virtual team who would take all out of hours on call work from every team. The team would consist of volunteers, not conscripts, from any team in the engineering organization. Engineers would rotate out of the virtual team after six months or so, having done a handful of weeks on call. Thankfully, we had no problems getting enough volunteers to start the virtual team.

Our on call went from being spread across more than 30 engineers to just 6 or 7

The team then agreed upon and defined what acceptable alarms and runbooks look like, and described an acceptance process for moving alarms over to the new on call team. They defined all our alarms in code using a Terraform module, and started using peer review for every change. We put in place a level of compensation that we were happy with for taking a week’s worth of on call shifts. We also created a “Level 2” escalation team made up of engineering managers to be a single point of escalation for the on call engineer.

It took a few months of hard work ironing out our process, our on call went from being spread across more than 30 engineers to just 6 or 7. Our engineering teams still do on call for their features and services during office hours, which is when things tend to break the most, but our out of office on call is fully owned by a dedicated set of volunteers.

What we learned

After we launched our virtual on call team, we expected a large amount of follow-up work after an on call shift, such as researching the causes of alarms and collaborating on solving the problem that caused the page. However, our engineering teams took very strong ownership of anything that caused a page, and any follow-up generally has had prompt action. We also haven’t had to threaten the nuclear option – handing an alarm back to the team where it originated due to lack of follow-up and forcing them to do out-of-hours on call again.

The number of out-of-hours pages has dropped to less than 10 a month

Our formal escalation process has rarely been used. A more common scenario is the on call engineer is informally helped out by engineers who happen to be online at that time, notably from engineers from our San Francisco office. Numerous problems have been repaired or mitigated through on the fly collaboration and teamwork.

Engineers in our San Francisco office have joined the team fully and gone beyond just ad hoc support. There is a degree of added overhead, but spreading team membership across multiple offices has been very successful for us – it’s a great way to build relationships and depth of knowledge about the technology stack which we all work on.

The experience of being an engineer in Intercom is now way more consistent across our teams, and we can confidently advertise Systems Engineer positions on our Careers website stating that there is no on call required, unless you really want to do it.

Along with foundational work stabilizing and scaling out our datastores, the consistent focus on resolving pages has resulted in the number of out-of-hours pages dropping to less than 10 a month, a number we’re very proud of.

We are continuing to work to maintain and improve our on call team, and as Intercom grows we may need to revisit the decisions – what works today may not work the next time our team doubles in size. That said, this work has been very positive for our engineering organization, significantly improving our engineers’ quality of life, the quality of our on call response and, above all, the customer experience.

If this sounds like the sort of environment that you would enjoy, we are actively hiring – check out our openings.

Careers at Intercom

The post Rapid response: How we fixed our on call process to avoid engineer burnout appeared first on Inside Intercom.

Why your engineering processes need to solve real problems

I came to Intercom from a company with a culture of heavyweight engineering processes. It was a well-oiled machine with battle-tested and often updated procedures.

From an engineering perspective, it successfully kept you focused on coding. Tasks were always well-described in Jira, with clearly defined expectations. Designs came in and were exported to HTML so you didn’t have to worry about using Sketch. You did your job, then moved the task to QA. If something came back, it was always with a good description of what wasn’t working.

Processes have to serve the development of the product.

When I started at Intercom, however, I was surprised at how lightweight the weekly engineering processes felt compared to my previous company. No estimations. No Jira. No separate QA team. Initially, I felt overwhelmed. I wondered why it looked this way, why everyone just aligned and no one tried to structure the processes as I was used to.

The main reason is that in both of these companies, there were different problems to solve, even though it looked similar on the surface. Intercom is very much a product-first company, and very heavyweight processes can be too much of a constraint in a product-first company. In this sort of environment, the processes have to serve the development of the product, rather than the product developing out of predetermined processes.

At Intercom, we have a very strong culture of solving the right problems. We are ruthless in defining what the true problem is, how we solve it using a small, well scoped project (or a cupcake, as we like to call them) and how it might eventually look like if the cupcake proves to be successful. In short, we ask what is the problem and how will you measure that it’s solved. And we don’t just use this approach when working on our products – we try to apply the same approach whenever we want to add new or adjust existing processes.

The subconscious benefit of processes

In any organization, processes are important and beneficial. They streamline the workflows, help people make fewer mistakes and bring some degree of comfort – having a good set of processes can create the sense that work has already begun to proceed.

In this way, processes are usually comfortable, in the sense that they are institutional habits. We are stretched in our jobs already, so work that is aligned to a process is similar to a habit. The process is already de-risked and is thoroughly thought through, and ideally has a proven track record of successes. It removes a lot from your plate to let you focus on what’s important. It’s compelling to have less on your plate, right?

Solving the problem you have

Whenever you are designing a new process, the most important and the hardest part will be to clearly define the problem you are attempting to solve. It’s crucial not to skip this step. If you don’t identify the problem clearly, then you need to ask yourself why you are even starting. Proceeding without a clearly defined problem can be a sign of a worrying tendency for bureaucracy – and this can often be the first step towards alienating your best people.

Work that is aligned to a process is similar to a habit.

Instead, processes must be agile. They are innovative. They let you move fast. They take a cognitive overhead off your plate to let you focus on the most important things. But only if you solve proper problems with them.

I am sure that you can easily find out at least couple of problems you would like to get rid of. It can be something huge as “we are making mistakes with people we hire” which means that we need a better recruitment process. In software consulting, the problems are predictability and accountability for your customers. At Intercom, it’s making the absolutely best product.

Define the success criteria

When you have a good understanding of the problem, define the success criteria for your process. Don’t start with the process, start with what success looks like. Starting from success gets rid of your biases around the design (what you are familiar with, what you are comfortable with, etc.) and focuses instead on the best outcome possible. This defines the true success of the process. Remember, the usage of the process in itself is not a measure of success – usage without value is a clear failure.

It’s easy to get into the trap of the “usage is a success” in situations of high discomfort. If you feel uncomfortable with the current level of structure around you, you start thinking about improving the structure and introducing new processes. But if processes don’t solve real problems and are not being constantly improved to meet the success criteria, they make people stop innovating and harm your culture.

Update your process periodically

It’s important to update or get rid of old processes once they have outlived their usefulness, rather than remaining reliant on them. The whole exercise of designing a process is based on solving the problem. However, this problem is present right now, at the time you design the solution – the problem won’t remain static, and therefore the process shouldn’t either.

If processes don’t solve real problems, they harm your culture.

To make sure that you are not solving the wrong problems, you must encourage everyone using the process to challenge the status quo. In order to achieve that, you have to make sure that your processes are easy to change.

Master your habits, and processes

Processes should be beneficial and helpful without becoming burdened by bureaucracy. They can help you innovate, move fast and keep focused. However, you need to remember that every company is trying to solve different problems and therefore different processes. The worst case scenario is when you try to apply processes that don’t solve problems or don’t serve the goal of the company.

Like habits, some processes are good, some are bad, and some outlive their usefulness. And like habits, processes can be hard to change. But remember that successful companies, like successful people, are defined by their ability to develop and change their habits, rather than becoming beholden to them.

If this sounds like the sort of environment that you would enjoy, we are actively hiring – check out our openings.

Careers at Intercom

The post Why your engineering processes need to solve real problems appeared first on Inside Intercom.