Blog

Dealing with data failure

In a world of cyber threats and unreliable power feeds, how do we ensure our applications remain available and that data flows up and down to the cloud?

Summer is here (unless you are reading this blog six months after it was published – in which case, Happy Christmas) and, in a repeat of the heat of summer 2016, we have news of a data centre failure due to a power outage.

There has been a great deal of speculation and furore in various news outlets around the reasons for the latest failure, which whilst smaller in scale than last summer’s issues in Docklands, is far more newsworthy as it brought BA to a standstill across the world. 

My interest was piqued as recent industry focus, pre and post-Wannacry, has been on cyber security, and here we are with a good old-fashioned physical issue. With all the focus on security, DDoS and protecting networks, the physical and infrastructure elements of the data and applications have become somewhat taken for granted.

We assume that there will be perimeter security to prevent unauthorised access to racks, hardware, patches and servers. We assume there will be a managed environment to keep everything clean, dust free, dry, and warm – but not too warm. Yet recent events show that even state-of-the-art data centres can fail.

Which brings us to the question posed above: “In a world of cyber threats and unreliable power feeds how do we ensure our applications remain available and that data flows up and down to the cloud?”

The answer lies in maintaining a focus on resilient connections, diverse routing and intelligence to react to changing situations. SDN offers increasing capability to detect and react to issues and brown outs to ensure connectivity remains high. Ultimately, the connectivity must be to resilient data, stored and managed in a resilient manner.

This means that the data centres must be separated so that a single incident will not bring down both facilities with similarly diverse connectivity. This entails being far enough apart as to not be simultaneously affected by common natural issues such as floods or fires, and avoid using common or shared infrastructure such as connectivity or power.

Whilst needing to get the basics right, vigilance against cyber threats must be maintained by keeping a focus on the usual suspects:

  • Prepare for GDPR, and the impact it will have on network security requirements
  • Identify and guard against the cyber threats the business is facing
  • Monitor those security threats of most concern
  • Plan ahead for network security and how to stay ahead of the game

You can get these basics right and ensure resilience is in place by asking the following questions:

  • What happens if the primary data centre fails?
  • Can the network react if the main link to the primary data centre goes down?
  • What happens if the link is not lost, but suffers severe degradation?
  • Is there access to cloud services providers if the main link to the cloud is unavailable?

As further details emerge behind the BA failure it appears that the fault lies with human error. This raises three questions:

  1. In a mission critical environment like this, there are systems and processes to help support the people on the ground and guard against making simple mistakes. These checks and balances are not new – they pre-date ITIL by decades and were put in place specifically to help people avoid precisely this type of cumulative error, resulting in an aftermath far in excess of the original issue. So, why were these checks and balances not effective?
  2. Given the mission critical nature of the data being stored, why did a geographically diverse back-up data centre not kick-in to provide continuity of service for the airline?
  3. When outsourcing key areas to specialist contractors, how could these specialists have acted with such naivety as shown in this case? 

Finally, a simple litmus test – when reviewing IT and IS systems and processes, perhaps we should all follow BA’s lead and ask whether we would rely on these same systems and processes to keep us in the air?