How the COps team deals with system outages

99.99% of the time, the systems at Cuvva run smoothly. But things can go wrong. Outages happen. And when they do, we need to be ready to deal with them.
By Team member, 06/09/2019
4 minutes read

99.99% of the time, the systems at Cuvva run smoothly.

But things can - and occasionally do - go wrong. Outages happen. And when they do, we need to be ready to deal with them.

We like to think we're getting pretty good at it. But it's far from easy. Figuring out the right process takes time. And sometimes it tests our teamwork and communication skills to the limit.

Here's how it works, how it will work in the future, and how we got to this point.

What we used to do (and why it wasn't enough)

During office hours, it's always been pretty straightforward to handle system outages.

That's because we have an ever-growing team with a massive range of skills - from Customer Operations (COps) to Marketing, Data, Product and more.

And we're a talkative bunch. We communicate all day and all night on

Slack

.

If something breaks while we're all in the office, we can get the right people on the case as soon as possible, because they're usually sitting just a few feet away. And there are plenty of helpful sorts who are willing to jump on customer support and keep our customers in the loop.

It always worked quite well.

But when things went wrong outside office hours, it was a bit of a nightmare.

The dreaded midnight outage

Let's say we had an outage at midnight.

Because we offer 9am to 9pm customer support, with an average 1-minute response time, a good chunk of the team

works remotely

, or overnight. This means we can respond to our customers as quickly as possible.

But if something breaks, those COps got a bit isolated. They could get in touch with engineers to get the problem fixed, but there wasn't much help with the workload on customer support. Because everyone else was in bed.

So that one person would face a sudden deluge of customers, all with the same issue, all at once.

At the same time, that already overworked COp would have to pass the message onto the relevant people in the team - all while trying to keep up with customer support.

So we took another look at the process.

Understanding problems and finding solutions

To figure out a new process, we needed to map out exactly what happened when systems stopped working.

We knew that the problems were usually spotted first by the COps team. A sudden spike in the number of people contacting support with the same problem is the first sign of trouble.

Even though we're trained to deal with difficult situations and handle multiple conversations, that kind of volume can still catch you off guard. Because it feels a bit like talking to 10 people on the phone at once.

Narrowing down


The first thing COps do is narrow down the problem. There's no set-in-stone way of doing this, but we usually want to know:

  • Is it happening on a particular platform or channel? iOS, Android, the website?

  • Is it likely to be our problem or a third party's?

  • Can people still buy policies?

  • Which types of insurance are affected?

  • Have we rolled out any tweaks or updates that might have caused it?

Once we've picked out a few trends, we'll feed it back to the team leaders. And they'll pass it to the people we need to fix the problem.

(This is one of the key things about the COps team at Cuvva -

being able to work with other teams.

Whether that's backend engineers, designers, data analysts or app developers, we need to understand the product and the business well enough to spot issues and get them fixed.)

But while all this is still going on, the poor COps are still dealing with a huge volume of messages from (understandably) frustrated customers.

So we turn to the wider team for help.

All hands on deck

We train every member of the team to offer customer support. And that means we can always rely on a few helping hands when there's a lot of customers in need of answers.

We'll also put something on social media (Twitter's a great way to reach as many people as possible when there's a problem), and we update our

status page

, which is a handy link for customers.

To make sure we're still replying to customers quickly - and to make sure we're still writing as clearly, concisely and warmly as we always try to - we'll usually draft a last-minute saved response. This is a pre-written message that tells customers everything they need to know about the problem quickly and comprehensively.

Once things have died down a little, we spend a bit of time going back through old conversations with affected customers to say sorry and explain what happened. It's an important part of

treating customers fairly

.

The day after, we'll sit down as a team to figure out:

  • What went wrong
  • How we can stop it happening again

How we've changed things - and what the future holds

To make these processes smoother and simpler, we now have a dedicated team of on-call engineers who can fix issues, and we've hired more on-call COps to offer customer support outside of office hours.

We've also set up clear "escalation procedures" to figure out who needs to be told about system outages.

And we'll be training more COps to use social media and update the status page.

The eCOp role

Finally - and most excitingly for us - we've made a new role: the Embedded COp, or "E-COp". This is a member of the COps team who works directly with the product teams - they're "embedded" within them.

The E-COp hosts meetings and uses feedback from our customers to help decide how we build products.

It's is a great way to improve the app (and our own systems), and make sure we're building products that are perfect for the people using them.

The role rotates between members of the COps team every 8 weeks, and it's a great chance to work closely with our product teams.

As the nature of customer support changes, with new products, technology, processes and changing expectations, we always need to work hard at getting better.

Outages will still happen, now and again. Things will break. They always do. But every time it happens, it's a chance to learn. It shows us how we can improve our systems, be more efficient and make sure our customers have the best possible experience.

Want to be part of it? Take a look at our job openings.

Team member