In the constant search for better, more reliable, more available systems and services, customers are demanding ever higher levels of service availability. Measured as the “number of nines”, the Service Level Availability, or, SLA has become the key measure of how good a service is.
The number of nines
Most cloud providers offer 99.9% availability, calculated on a monthly basis. That sounds great, until you work out that in a 30 day month, your service could be down for as much as 44 minutes. It’s known in the industry as a TITSUP event – Total Inability to Sustain Usual Performance (although I think the acronym may have come first).
Of course, if that outage happens to be at 2am, many organisations won’t even notice. On the other hand, the service could be down for very much longer – and all you’ll get is a credit note.
As a matter of fact, the legal issues around service credits are that they can’t be “punitive” in nature – if they are, then your contract is voided. So it’s unlikely that you’ll receive adequate compensation for an extended outage – and if the system or service in question is something your business relies on, the cost of an outage will be high.
The obvious recent example is British Airways’ recent outage, but for every IT disaster that makes the news, there will be many more that cause pain, but aren’t publicised.
It is true, however, that IT service providers generally do better than most in-house IT organisations running equipment and providing an available service – but that’s cold comfort to an embarrassed CIO that has to explain to the board why they couldn’t trade last week.
In order to make systems and services ever more available, companies and service providers try to “double up” on equipment: servers get two power supplies, disks get RAID, network cards come with two (or four) cables attached. Wide area network links get deployed in pairs. All of this costs money of course, and it’s tempting to do without the backup in order to keep costs under control – right up to the point where it stops working because a single component failed.
The dirty little secret of the IT world is that whilst component failure is often blamed for outages, it’s most often that the failed component is the human being trying to do the right thing, but getting it wrong. BA’s recent experience is testament to this.
So what are the major causes of IT outages? Hardware is incredibly reliable nowadays – I can’t recall the last time we saw a server actually fail. More often it’s the hard drive – there are statistics from Backblaze which indicate that approximately 10% of all hard disks fail each year. But hard drives are cheap, and putting two or even three in is easy to do.
The way that availability is calculated (this originated in the manufacturing sector, and was adopted wholesale by the aviation industry) is to combine the average time between component failures with the average time to repair a failure to work out the availability percentage. This means that availability percentages can be a bit misleading – if my aircraft is reliable for 10 years, but when it breaks, it’s broken for two days, my availability percentage could be really good (it’s actually 99.95%) – but I’m still unable to fly for quite a while.
When service providers commit to high SLAs like this, there are two ways they can do it: design a technical solution that is meant to meet or exceed the SLA, or design a cheaper, less reliable system, and accept that they will have to offer credits from time to time. The correct technical solution can be very expensive to build – with everything duplicated or even tripled, the costs mount up quickly.
How to cheat your way to a better SLA
A cheaper way for the service provider to offer a high SLA is to charge an extra 10%, then pay up to 10% back in service credits if the service happens to be down that month.
This may be a good wheeze to get the costs down, but to my mind it’s not giving the customer what they’re paying for – a reliable, good service that works.
Designing a technical solution to meet the availability requirement is all well and good, but we now come to the dirty little secret: hardware failures don’t generally cause outages – people do.
Most IT outages arise as a result of a change gone wrong. Amazon managed to kill its own service when an engineer mistyped a command, while Level3 took out most of its USA customers’ phone lines off-line for 15 hours last year. The list goes on.
There are two ways to manage any risk – reduce the impact (mitigate) or reduce the likelihood (eliminate). The first way is easier – most retailers insist on a freeze on all changes in the run-up to Christmas, while many universities forbid any changes during the busy clearing period.
Important as it is to eliminate the risk – implementing proper change control is a hard thing to do, but when you have hundreds of customers and many thousands of users relying on the IT service, it becomes essential.
The right way to do it is to plan all changes in advance, issuing planned maintenance notifications so that customers know what you’re doing, what the impact will be, and how long it will last. Then you must make sure that all change plans include how the change will be implemented and how we’ll know that the change worked – and what will be done if it doesn’t. As you’d expect, ITIL, the gold standard when it comes to in-life management of IT systems goes into a lot more detail.
In particular, change plans need to be written and reviewed before they can go ahead. This is obviously more expensive than just getting on with it in the short term, but in the long term it’s the only way to avoid expensive, embarrassing and career-limiting mistakes.
When you’re looking at using a company to provide a service that needs to work, you need to evaluate that supplier carefully. Is their design good enough to actually achieve the SLA they’re promising? Are they cutting corners and crossing their fingers, or will they truly deliver the availability you need? And do they do change management properly?
If you don’t check them out properly, then you may find that you’re due lots of service credits, but you haven’t got what you wanted – systems that work.