Dec 7, 2020

Disaster Recovery (DR) Testing: The Why, What and Who

Carleton Hall, Solutions Engineer

The importance of disaster recovery (DR) and business continuity plans can’t be overstated. Here on the ThinkIT blog, we’ve covered how to get started, the basics of making a plan, table-top exercises to help your staff test your DR plan and more. In this piece, I’ll explore the importance of DR testing—why it’s so important, what elements need to be considered and whether you should handle testing on your own or outsource it to a third-party DRaaS provider. We’ll also review the options for running a failover test.

Why DR Testing: Imagine the Worst-Case Scenario

Those of us in the DR business on 9/11 remember the devastating, tragic story of Cantor Fitzgerald, a financial services firm that lost 656 of their 960 employees that morning. The company occupied floors 101- 105 of the north tower at the World Trade Center. At the time, I’m sure IT disaster recovery and business continuity were not top of mind for those involved in the painful tragedy.

From an IT and business continuity standpoint, however, Cantor Fitzgerald was able to get systems online 48 hours after the attacks. They used a DR company at that time called Comdisco and were able to get remaining employees up and running answering phones, emails and stock trading within five days of the attacks. Today, Cantor Fitzgerald is still in business with about 10,000 employees. The company also followed through on CEO Howard Lutnick’s promises that were made to families and surviving employees.

What: DR Testing Essential Elements

How did the Cantor Fitzgerald survive? They had a true, tested and documented DR and business continuity plan. A combination of internal experts and assets and a third-party vendor helped the company create and test a detailed, scripted recovery plan and methodology.

Let’s briefly explore two key elements that need to be considered in DR testing processes, whether you choose to handle testing yourself, outsource it to a third-party or use a combination of both.

Recovery Point Objectives

Recovery Point Objectives (RPOs) preserve the company’s critical data with point-in-time backups or real-time replication to off-site media, like tape or online storage. Determine the impact of data loss in time (how long since the last good backup) and money (how many transactions and how much revenue will be lost because of it) and preserve your data accordingly. This is the “easier” element of DR testing, although nothing in IT or DR is exactly easy.

Recovery Time Objective

The tougher challenge in testing is determining Recovery Time Objective (RTO), which is how long will it take to restore enough functionality to keep the business running, and how quickly employees, vendors and customers can be tracked and connected in those DR systems. Once that has been established, the next important step is to determine how long it will take to get all that data back into the production environment and re-connect people once the disaster is over.

Who: DIY and Third-Party DR Considerations

One of the more logical IT initiatives to throw in the cloud or hand over to a third-party vendor is disaster recovery. After all, the old joke from CIOs down to IT Directors is “DR is #4 on my top three ‘To-Do List’ right now.”

Personally, however, I would caution giving all the keys away to a third-party provider. An outside provider is less likely to have the passion for your business that a proud employee has, and they will not know the intricacies of the business, such as revenue drivers and the IT applications.

On the other side, choosing a good third-party vendor and solution will save you time and probably money, preventing you from buying and owning two environments. A third-party DRaaS provider can also manage the infrastructure behind the scenes, allowing your team to focus on production, customers and vendors.

My advice? Choose a solution provider that allows some co-management with your team. You should work hand in hand with your third-party DR vendor. Make them an extension of your team and manage them as you would any employee or critical application in your environment. Leverage them for what they are best at, while at the same time holding yourself and your organization accountable for your vendor’s participation in and seamless execution of your DR plan.

Running a Failover Test

You have many options of what and how to test. Some companies will do a full-blown failover of the entire environment, while others test subsets of their environment in a crawl, walk, run methodology. In either case, I find it most effective to work with a provider to isolate your test environment from the replication environment. That way, you can continue to replicate valuable information while you are testing. In the event of a disaster while you are testing, you will still be able to achieve your required RPO/RTOs.

The test environment should be validated with transaction and remote connectivity from users and departments as if the production data center is no longer accessible. Here you will find (and actually may hope to find) holes in your plan and be able to document improvements and changes from your previous test.  A DR Plan is an ever improving, ever evolving, living, breathing document.

Failing or not having a perfect DR test is not necessarily a bad thing. You are of course striving for a perfect test leading to a perfect failover in a real disaster. But the idea of testing is to find holes in your plan, update changes in your DR Plan since the previous test and continue to improve your recovery plan.

You should definitely choose a third-party DR provider who bundles in test time—either one or two tests per year at a minimum—and who also provides documentation and runs books back to you following a test. You both need to be in sync should a disaster strike and to make the next test a success.  There is nothing worse than re-inventing the wheel with your third-party provider at the start of every new test.

Always test, keep your critical disaster recovery systems up to date with your DR systems and test again. Disasters don’t occur very often, but when they do, the effects can be devastating. Be ready.

Explore HorizonIQ
Bare Metal

LEARN MORE

Stay Connected

About Author

Carleton Hall

Solutions Engineer

Read More