Worst Practices: This is Not a Test

October 2001

This Is Not a Test

By Scott Hirsh

For those of us in the United States entrusted with a company’s information resources, the events of September 11 changed everything. Before our business continuity or disaster recovery plans were primarily concerned with so-called “acts of God.” But we must now plan for the most improbable human acts imaginable. Who among us, prior to September 11, had a plan that took into account multiple high-rise office buildings being destroyed within minutes of each other? As you read this, the insurance industry is revising its assumptions. Likewise, we must now reconsider our approach to managing and protecting the assets for which we are responsible. Never before has the probability of actually needing to execute our recovery plans been so great.

As of this writing there have already been numerous business continuity and disaster recovery articles in the computer press. By now we understand the distinction between keeping the business going – not just IT, but also the whole business – and recovering after some (hopefully minor) interruption. And we’ve covered the issue of risk, where all the trade-offs and costs are negotiated. This whole topic was explored anew in the last few months, but it is still worthwhile to emphasize some early lessons of the attacks, from which we are still recovering.

It Had Better Work

Worst Practice 1: Trying to Fake It — I was visiting a friend’s datacenter recently, where I was told about a recent audit. This friend’s company spent the whole time trying to fake all the audit criteria: disaster recovery preparedness, security, audit trails, etc. At the risk of sounding like your parents, whom does this behavior really hurt? An audit is an ideal opportunity to validate all the necessary hard work required to run a professional datacenter. And should you ever be subjected to attack, electronic or otherwise, you know that your datacenter will survive.

If you didn’t get it before, you’d better get it now: Faking it is unacceptable. Chances are, at some point you will be required to do a real, honest-to-goodness recovery. And if you think you’re safe just because there may not be very many hijacked planes running into buildings such as yours, think again. The threats to your datacenter are diverse and numerous. And, by the way, violent weather, earthquakes and other natural disasters are still there too.

Worst Practice 2: Not Testing — Once you’re serious about continuity and recovery, not only will you plan, but you’ll test that plan often. There are lots of reasons to test your recovery capability often. Among them are: the ability to react quickly in a crisis; catching changes in your environment since your last test; accommodating changes to staff since your last test. A real recovery is a terrible time to do discovery.

Worst Practice 3: Not Documenting — One of the biggest problems with disasters is no warning. That’s why so many tests are a waste of time. Anyone can recover when you know exactly when and how. The truly prepared can recover when caught by surprise. Since you won’t get any warning – except, perhaps, with some natural disasters – you’ll want to have current, updated procedures. Since you’ll probably be on vacation (or wish you were) when disaster strikes, make sure the recovery procedures are off-site and available. If you’re the only one who knows what to do, even if you never take a day off there still won’t be enough of you to go around at crunch time.

Increasing the Odds of Recovery

Worst Practice 4: Taking Too Long — At this point in technology, there are two main ways to deal with a disaster: fail-over and reconstruction. With fail-over, you are replicating data between your main site and a recovery site. These sites can be relatively near each other – across town or perhaps in an adjoining states – or far away. This kind of remote clustering, if you will, is what the largest and most critical institutions use, and the cost is considerable. However, the cost of not doing it is considerably more.

Reconstruction is more about recovery than continuity. I am guessing that the vast majority of e3000 shops base their recovery plans on recalling tapes from a vault (e.g., Iron Mountain) to a recovery site, then restoring their data either to a bare machine or one on which only MPE has been installed. This was certainly true for my own operation, as my management always deemed this less expensive method “adequate.”

But that was then. Today, the amount of data that must be reloaded is so massive, that the time to recover renders this method all but worthless. True, your plan can call for a critical subset of data to be restored (not the entire data warehouse). But even current data can now stretch into the terabytes, once you include the applications, utilities, etc.

So the point here is to make sure your recovery methodology is practical from a business standpoint, as well as a technical standpoint. You don’t want to be in the position of estimating “just three more days” before you’re up and running.

Worst Practice 5: Not Recovering a Complete Environment — As the state of the art advances, some technology is left behind. We’ll keep it succinct here: If you need to keep an old technology alive, you may need to provide some or all of the solution yourself. Don’t expect the recovery site to stock or maintain every peripheral ever made just because you have one esoteric requirement. And don’t forget to keep backup copies of any obsolete software packages as well.

Another aspect to this issue, recently discovered at a customer site, is the fact that diverse platforms are now highly integrated. It’s not enough just to recover the e3000. The non-e3000 systems that share data feeds must also be recovered. And don’t forget any outside data sources either. Again, if you’re faking it, you can declare victory when you’ve reconstructed an e3000 at the recovery site. In reality, that only counts if the e3000 system can support the business on its own without any external feeds.

Worst Practice 6: Ignoring the Human Factor — Even the best plans don’t execute themselves. Keep in mind who will be doing what and how things will get done if key individuals are unable to perform their tasks. As we know, families come first, which is proper: so we mustn’t lose sight of our humanity in times of crisis. Any recovery is hard work. That counts double when there are casualties.

Reassess Your Assumptions

Worst Practice 7: A Defeatist Attitude — If you’ve been subjected to the “fake it” mentality, you’re probably demoralized. After all, who among us just wants to go through the motions? Well, it’s now a whole new world, and you have a really good shot at doing things right. But you need to forcefully make your case to those who didn’t take contingency planning seriously in the past. By the time you read this there may be stories about companies that unfortunately couldn’t recover from the September 11 attacks. We can emerge from this atrocity stronger if we do some honest introspection. Every rational businessperson should now be willing to do proper planning. If you can get over the bad practices of the past, you can position yourself and your business to be survivors.

Worst Practice 8: Datacenter Placement — As much as I enjoyed the view from my 29th floor datacenter, it’s pretty obvious now that datacenters don’t belong in certain places – high-rise buildings among them. Besides the obvious prohibitive cost of floor space, there are safety and security issues not obvious until recent events.

I have visited many co-location facilities in the past year, and they all had a several things in common:

1. They were in the low-rent district.

2. They were very difficult to find, as they were essentially unmarked.

3. They were very secure (at least relative to downtown datacenters), both physically and electronically.

4. They were redundant up the wazoo.

If this does not describe your datacenter, then perhaps it’s time to consider relocation. Let’s face it, even if there are good reasons why your datacenter needs to be right downtown, I’ll bet your recovery site is in the middle of nowhere. That should tell you something.

Hope for the Best

We’re currently in reactive mode. We’ve now seen one type of unimaginable act, using airliners as missiles. For those unlucky enough to be on the front lines of that atrocity, there was no way to plan for that series of events. And it’s likely that the next event will also be difficult to imagine, and hence plan for. So even the best plans require a great deal of luck, as even the best plan is useless if there is widespread devastation beyond your control. We should be honest about those aspects of business continuity and recovery that are within our control. We must be truly prepared. But we can still hope that we never need to actually use those plans. Not like we did after September 11. At least that’s the hope.

Scott Hirsh (scott@acellc.com) former chairman of the SYSMAN Special Interest Group, is an HP Certified HP 3000 System Manager and founder of Automated Computing Environments (925.962.0346), HP-certified OpenView Consultants who consult on OpenView, Maestro, Sys*Admiral and other general HP e3000 and HP 9000 automation and administration practices.