99.99% of the time, the systems at Cuvva run smoothly.
But things can - and occasionally do - go wrong. Outages happen. And when they do, we need to be ready to deal with them.
We like to think weâre getting pretty good at it. But itâs far from easy. Figuring out the right process takes time. And sometimes it tests our teamwork and communication skills to the limit.
Hereâs how it works, how it will work in the future, and how we got to this point.
During office hours, itâs always been pretty straightforward to handle system outages.
Thatâs because we have an ever-growing team with a massive range of skills - from Customer Operations (COps) to Marketing, Data, Product and more.
And weâre a talkative bunch. We communicate all day and all night on
Slack
.
If something breaks while weâre all in the office, we can get the right people on the case as soon as possible, because theyâre usually sitting just a few feet away. And there are plenty of helpful sorts who are willing to jump on customer support and keep our customers in the loop.
It always worked quite well.
But when things went wrong outside office hours, it was a bit of a nightmare.
Letâs say we had an outage at midnight.
Because we offer 24/7 customer support, with an average 1-minute response time, a good chunk of the team
works remotely
, or overnight. This means we can respond to our customers at all hours of the day.
But if something breaks, those COps got a bit isolated. They could get in touch with engineers to get the problem fixed, but there wasnât much help with the workload on customer support. Because everyone else was in bed.
So that one person would face a sudden deluge of customers, all with the same issue, all at once.
At the same time, that already overworked COp would have to pass the message onto the relevant people in the team - all while trying to keep up with customer support.
So we took another look at the process.
To figure out a new process, we needed to map out exactly what happened when systems stopped working.
We knew that the problems were usually spotted first by the COps team. A sudden spike in the number of people contacting support with the same problem is the first sign of trouble.
Even though weâre trained to deal with difficult situations and handle multiple conversations, that kind of volume can still catch you off guard. Because it feels a bit like talking to 10 people on the phone at once.
The first thing COps do is narrow down the problem. Thereâs no set-in-stone way of doing this, but we usually want to know:
Is it happening on a particular platform or channel? iOS, Android, the website?
Is it likely to be our problem or a third partyâs?
Can people still buy policies?
Which types of insurance are affected?
Have we rolled out any tweaks or updates that might have caused it?
Once weâve picked out a few trends, weâll feed it back to the team leaders. And theyâll pass it to the people we need to fix the problem.
(This is one of the key things about the COps team at Cuvva -
being able to work with other teams.
Whether thatâs backend engineers, designers, data analysts or app developers, we need to understand the product and the business well enough to spot issues and get them fixed.)
But while all this is still going on, the poor COps are still dealing with a huge volume of messages from (understandably) frustrated customers.
So we turn to the wider team for help.
We train every member of the team to offer customer support. And that means we can always rely on a few helping hands when thereâs a lot of customers in need of answers.
Weâll also put something on social media (Twitterâs a great way to reach as many people as possible when thereâs a problem), and we update our
status page
, which is a handy link for customers.
To make sure weâre still replying to customers quickly - and to make sure weâre still writing as clearly, concisely and warmly as we always try to - weâll usually draft a last-minute saved response. This is a pre-written message that tells customers everything they need to know about the problem quickly and comprehensively.
Once things have died down a little, we spend a bit of time going back through old conversations with affected customers to say sorry and explain what happened. Itâs an important part of
treating customers fairly
.
The day after, weâll sit down as a team to figure out:
To make these processes smoother and simpler, we now have a dedicated team of on-call engineers who can fix issues, and weâve hired more on-call COps to offer customer support outside of office hours.
Weâve also set up clear âescalation proceduresâ to figure out who needs to be told about system outages.
And weâll be training more COps to use social media and update the status page.
Finally - and most excitingly for us - weâve made a new role: the Embedded COp, or âE-COpâ. This is a member of the COps team who works directly with the product teams - theyâre âembeddedâ within them.
The E-COp hosts meetings and uses feedback from our customers to help decide how we build products.
It's is a great way to improve the app (and our own systems), and make sure weâre building products that are perfect for the people using them.
The role rotates between members of the COps team every 8 weeks, and itâs a great chance to work closely with our product teams.
As the nature of customer support changes, with new products, technology, processes and changing expectations, we always need to work hard at getting better.
Outages will still happen, now and again. Things will break. They always do. But every time it happens, itâs a chance to learn. It shows us how we can improve our systems, be more efficient and make sure our customers have the best possible experience.
Want to be part of it? Take a look at our job openings.
Follow us on...
Team member