We look at some key pitfalls in disaster recovery, such as failing to plan, not testing the plan, not protecting backups, poor communication and neglecting the human element
Disaster recovery (DR) is the ability to return to “business as usual” operations after an IT failure, natural disaster, or other unexpected event, and is a key function of IT.
After all, the IT department is responsible for maintenance of core business systems and for protecting their data, for providing desktop or other personal computers, networks and more often than not today, voice communications.
But disaster recovery planning is a business-wide challenge and responsibility. Organisations depend ever more on their data, and IT is becoming more and more adept at providing access to that data anywhere in the world.
Against this, IT departments need to deal with ever larger volumes of data, as well as users and customers who are less tolerant of downtime, and a growing number of bad actors who see attacking data as a way to bring down organisations for financial gain.
The international standard for business continuity, ISO 27031, sets a framework for organisations’ disaster recovery plans.
But given the increasing complexity of both business operations and IT systems. there are plenty of traps for the unwary.
DR pitfall 1: Failure to plan
The greatest failing is to fail to plan for disaster recovery at all.
A DR plan need not be complex. In the case of a small business or branch office, it might comprise little more than regular backups to disks stored offsite or, increasingly, to the cloud, and a plan for how to access the data and restore applications if the worst happens.
For larger organisations, a plan will go into far more detail about which applications are protected, how they will be recovered, and arrangements for alternative workspaces for staff, such as in this example from IBM.
Tony Lock, an analyst at Freeform Dynamics, stresses that a plan should state in what order various platforms must be recovered. “Sometimes this is obvious from application or service requirements, but where a major site recovery is required, then internal politics may also come into play,” he says. “There is also the question of who can initiate a DR action and under what circumstances.”
Further problems occur when organisations have a DR plan, but it is too limited in scope. Here, IT and the board can be lulled into a false sense of security. In such cases, there is a DR plan, but it fails to cover all applications and, vitally, their interdependencies.
“Only about 38% of applications are protected by a DR plan,” cautions Phil Goodwin, an analyst at IDC. “Most organisations provide DR for mission-critical applications, but then move on to other projects. The result is often that these mission-critical applications are missing data or connections to less critical applications. And the entire environment can’t be stood up rapidly enough.”
The plan must also set the recovery point objective (RPO) and recovery time objective (RTO) – how far back the organisation needs to go to obtain a clean and stable set of applications and data, and how quickly that needs to happen.
DR pitfall 2: Failure to test
The next, and perhaps greatest, pitfall is failing to test. A frequently cited statistic is that 23% of organisations never test their DR plans, with a further 29% testing just once a year.
Whether an annual test is adequate will depend very much on the size and nature of the business. But a plan that is never tested is really only one step up from having no plan at all.
“The other big issue concerns testing of disaster recovery processes,” says Freeform Dynamics’ Lock. “This is essential because until you test DR you really cannot be certain it will work, or whether all systems that should have been protected have been.”
Ensuring a robust testing regime needs strong leadership from the CIO. Effective DR testing can be disruptive and expensive. But failing to recover from a disaster will be more expensive still.
“The problem can be that either business users or budget holders may be reluctant to allow testing to take place,” warns Lock. This is why strong advocacy from IT leaders is so important.
Closely related to failing to test the DR plan, is failing to update it. A disaster recovery plan is a living document. As the business changes through growth, acquisition, business process changes or technology updates, DR requirements and methods will change too. A detailed plan that sits on a shelf will not be effective.
If the organisation does test the plan, CIOs need to ensure that any lessons learned – and there will be lessons learned – are used to update the plan. The updated plan needs to be tested, and the cycle repeated.
DR pitfall 3: Failure to protect backups
Malware, and especially ransomware, is one of the reasons DR has moved back up the agenda in the past few years.
Protecting systems against ransomware in particular means creating an air gap between production systems and backup copies, or using immutable storage technologies, not least because attackers have learned to target data backups first. Some organisations have returned to tape as a relatively low-cost way to move data offsite.
Unfortunately for DR teams, this is not always easy. Business continuity plans and shorter recovery time objectives rely on continuous data protection.
“But you can’t airgap on a continuous basis,” warns IDC’s Goodwin. Instead, organisations might need to accept 12-24 hours of data loss as the price for clean data.
DR pitfall 4: Command, control and communication problems
In a disaster recovery situation, clear lines of communication and a clear idea of who is in control is vital.
Organisations also need to decide who can invoke the DR plan, and make sure all the key staff can continue to communicate during an outage. A robust DR test will usually expose any failures in command and control, and crisis communications should be part of the plan for larger businesses.
But there is a need for ongoing communication around DR and business continuity, too.
“Users have a perhaps unrealistic expectation of instant recovery for everything, and it is easy for things to go wrong as pressure mounts,” says Lock.
Clear communications will help manage expectations about which data and systems can be recovered, in which order, and how quickly, adds IDC’s Goodwin.
DR pitfall 5: Neglecting human factors
IT departments, naturally enough, focus their DR planning on systems and data. But effective plans need to also cover where and how people will work if the main business location is compromised.
It might be that staff can work from home initially, but how long can they sustain that?
Do some staff need desktop computers, or more bandwidth than domestic or mobile connections can provide? What about meeting spaces, and what about the physical and mental wellbeing of the team? Keeping up morale in the event of a disaster is often as important as the technical aspects of the recovery plan.
DR pitfall 6: Overlooking the cloud
Cloud computing is making some aspects of disaster recovery much easier, especially with the growth of online backup services.
But the cloud can add complexity to IT operations, especially in hybrid and multicloud environments.
Also, there is ability of business lines to spin up their own cloud resources, or to buy software-as-a-service (SaaS) applications, which means IT might no longer have a full picture of the organisation’s IT infrastructure. And does the plan include what to do if a cloud service goes down?
Research by Spiceworks found that just 28% of organisations included cloud or hosted services in their DR plans. And relying on the cloud provider’s own backup and business continuity plans is not enough.
There might be little the cloud provider can do, for example, if a user accidentally deletes data.
And a partial failure – for example of an onsite data store that serves a cloud-based application – can be harder to recover from than a conventional stack where the data and applications are in the same place.
But thorough testing should show up any weaknesses in recovery plans for cloud infrastructure too.