Worst Practices February 2001: Lights Out

Lights Out

By Scott Hirsh
Precision Systems Group

At the risk of being too topical, which gives this month’s installment a short shelf life, I can’t help but think of the HP 3000 during this season of unpredictable electrical power service in California. Some are arguing that the build-up of Internet-related businesses — including xSPs and other companies who build large datacenters — is a contributing factor to this crisis. Regardless of the root cause, many of us innocent bystanders may become collateral damage before a long-term solution energy solution is implemented. And that brings us back to the land of high availability.

We all know that the HP 3000 has a long history of high availability — when cared for properly. But through no fault of our own, we may be faced with downtime in an era of 24x7 expectations. So what are the alternatives for the HP 3000 System Manager who is unlucky enough to be caught in the crossfire of electrical utility “deregulation?”

First, to bring us up to date, we need to draw a distinction between “disaster recovery” and “business continuance.” Disaster recovery is reactive: something bad has happened and I must now recreate my operating environment in order to pick up where I left off when the utility pulled the plug. Disaster recovery implies downtime, although we try to optimize our recovery plan (which of course you have) to make that downtime as small as possible. Business continuance, on the other hand, means somehow keeping your operating environment up and running, with perhaps a few minutes pause, when something bad happens.

That something bad, in the case of the recent crisis, is an interruption of electrical power to the datacenter. Those who are lucky have backup generators that can span the outage. Those who are less lucky have UPSs that can gracefully shutdown their systems when a power interruption occurs. And those who are foolish just let their systems flop, hoping for the best. (This is an abuse of faith in the HP 3000’s robust architecture. Even if your system comes back up in good shape, it’s never okay to leave your system vulnerable to power interruptions.)

So it seems that the need for an appropriate response to a much more likely outage scenario than an act of god (e.g., earthquake) is more compelling than ever. I know that whenever I performed a disaster recovery drill I prayed to the datacenter gods that I would never need to recover for real. Now, it seems, the odds are against us. Guess what? You need to ratchet your contingency planning up a notch — closer to reality.

The first place to revisit is your backup strategy. Not only must you be sure that backup is working the way you think it is, but that your restore procedures are tested and foolproof. Believe it or not, I watched in horror not too long ago as a client grabbed the wrong tape and began to use it for a simple (but important) restore. Their focus was on the fact that the restore was for @.@.@. My question was, why is it you couldn’t manage to mount the correct tape? Let’s face it, using the wrong tape for a restore is never a good idea!

Taking the backup issue one step further, has your data grown so large that you couldn’t even accomplish a full restore in a manner timely enough to save your business? I applaud those who have backup down to a science, but what if it takes you 72-plus hours to get to your recovery site with your tapes and fully rebuild your system? Are you then, in fact, backing up for reasons other than recovery?

For companies whose data is too humongous to contemplate rebuilding, the focus moves to business continuance. Generally speaking, this translates to replicating your data real-time or near real-time to another location – ideally a location that has a different set of risk factors (say, in an area that doesn’t experience rolling blackouts or not on the same fault line as your primary datacenter). For performance purposes, this type of replication is best handled at the storage system level, but there are products (Quest comes to mind) that will accomplish this task with software. With two sites/systems mirrored, you transition from the primary site to the recovery site in the event of an interruption. This can be done within minutes, and without getting on an airplane and spinning tapes. For large, mission-critical applications, chances are you are doing this for at least one system in your datacenter. It’s just a matter of time before this requirement trickles down to your run-of-the-mill HP 3000.

To work in the inevitable sports metaphor, when it comes to protecting your operation, the best offense is a good defense. Translation: A good defense is protecting your operation so you never have to rebuild it (that’s the offense part) from scratch, a proposition that is becoming increasingly impractical. At the low end of the cost spectrum is protection for the physical and environmental aspects of your operation: backup generator, structural retrofits to withstand more numerous and severe acts of God. These are relatively small investments for relatively minor threats. For the big one, remote data replication and all the associated procedures for promoting your secondary site to acting primary.

Just because we manage the most reliable systems on the market doesn’t mean we can afford to rest on our laurels. Alas, even though we’ve done everything right to ensure system availability, the bums turned around and messed up our infrastructure (and raised our rates in the process, to add insult to injury). Who would have thought that one day we would find our HP 3000s even more reliable than the electrical grid? That was not what I expected to find on the other side of that bridge to the 21st century.

Scott Hirsh, former chairman of the SIG-SYSMAN Special Interest Group, is a partner at Precision Systems Group, an authorized HP Channel Partner which consults on HP OpenView, Maestro, Sys*Admiral and other general HP 3000 and HP 9000 automation and administration practices.