Worst Practices November 2000: Blind Update

Blind Update

By Scott Hirsh
Precision Systems Group

Update season is now upon us, as HP switches from carrot to stick in its attempt to move us along to the latest version of MPE. The version of the operating system we’re now running will no longer be “supported,” inducing fear in managers who imagine systems turning into pumpkins at the stroke of midnight on operating system judgment day. Support calls will go unanswered, patches will no longer be provided… basically, you’re on your own, pal.

To make matters worse, third-party software companies are piling on, vowing no support for the new version of their software on last year’s model of MPE. As in the case of a certain important software package for the health care industry, you’re now looking at updates to a major application and the operating system at roughly the same time. This is not a pleasant thought at a time of year when many businesses conduct annual employee reviews.

As any project manager knows, change equals risk. And risk is something we HP 3000 system managers have been very successful in minimizing. Yet all of us must eventually apply updates and patches to our systems, and this process of change has a way of turning stable environments into unstable ones. As usual, the shops that incorporate change without incident do not do so by accident. They do their homework, they plan, they test, they provide for contingencies, and then they execute. This kind of fundamental task should be simple — after the third or fourth time, anyway — but somehow it’s not. So to reinforce what you already know, let’s review some of the pitfalls of updating your perfect, stable environment.

Computer System as Ecosystem

According to Webster, a system is an “interdependent group of items forming a unified whole.” That means if you make a change in one place, chances are you will be affecting something in another place. This goes double for MPE, the HP 3000’s operating system, where an update affects everything from applications to peripherals. Therefore, worst practice #1 is not verifying compatibility of every critical application and hardware component with the revision of the operating system you plan to install.

When a system manager fails to thoroughly research operating system hardware and software compatibility, the results tend to be doubly tragic because an operating system update is typically performed on a weekend, when many third-party software vendors either do not or cannot provide an emergency fix. Furthermore, as Murphy would have it, the problem typically surfaces in the 20th hour of non-stop work. Take it from me, a desperate, sleep-deprived system manager speaking to a vendor’s answering service is not a pretty sight.

Cycle of Stupidity

We all make mistakes — once. But some of us never learn, because we focus on the particulars and not on the general theme. For example, the first time we updated we forgot to check the compatibility of a third-party utility. Well, we won’t let that happen again. So next time we make sure to call that vendor to make sure their product is compatible with the latest release of MPE. However, this time it’s a different third-party utility we missed, perhaps one acquired since the last time we updated.

By focusing on a specific product, we miss the theme of ensuring that a current inventory of applications and devices is maintained for the inevitable future update. By learning from past mistakes, and adjusting the update process, we avoid making the same mistake again.

No Testing, No Sympathy

Many of us still think that one cannot truly test a new operating system. After all, every environment is a unique set of hardware, applications and user behavior. Maybe so, but no testing almost guarantees that something will be missed. Sure, some system managers get lucky through a combination of thorough compatibility research and exchanges on the Internet. But luck is not consistent, and this is computer science, after all. So the question remains, what do you do about testing?

First of all, what are you testing for? Here’s my short list:

• Within the constraints of the Communicator, the system works as before the update. In-house and third-party applications run properly, commands execute without error, printers print, etc.

• Performance. Those with marginal configurations may need to consider a hardware upgrade.

• System resources, disk space in particular.

• Interoperability. I can still talk to other systems and platforms and they can still talk to me.

• Any new devices.

The next big hurdle is where to test. The options have not changed since that last update:

Your Very Own Test System

To this day, it is an uphill battle to acquire a test system. But we should never give up the fight, with a wide variety of used hardware available at reasonable prices. Eventually, you will find a used system inexpensive enough for even the most clueless manager to accept. But this must be a true “crash and burn” machine, separate from the system used by programmers and maintained at production service levels. Chances are, this machine will crash and burn on occasion. After all, the whole point of testing is to find those fatal bugs before they hit production.

Rent

The same people who bring you “previously owned” systems will also rent hardware. If you didn’t discover this during Y2K testing, you probably never will. Just know that renting may be a viable option if you only perform updates with a gun to your head.

Two Birds With One Stone

Any serious company is concerned about business continuity. That means a hot or cold site for critical systems, which usually includes HP 3000s. Nobody ever said that you could only use your backup site for disaster recovery drills. Indeed, the SunGards and Comdiscos of the world would be more than happy to let you test a new operating system at their site — for a fee. But if you plan it right, you can do both a disaster recovery drill and an operating system test at the same time. Then you get two tests for the price of one.

Plan A

Plan. There’s that four-letter word again. But plan you must, as there is so much more to an update than following the directions in the update guide. There’s the aforementioned compatibility research; any additional hardware requirements, in the unlikely event that the latest revision requires more CPU, memory or disk space; picking a time to actually do the update; notifying users that the system will be unavailable; etc.

The masochists reinvent the wheel every time they update, with one common excuse being that updates don’t occur very often (every five years whether I need to or not). Those of us who are risk averse (and pain averse) create a plan, then adjust it based on experience. If you can’t be bothered to plan for your sake, do it for me, the guy who inherited your mess.

Plan B

Okay, let’s say you’ve done your homework, you’ve tested, you have a plan for implementing the update. Then, in the process of updating, you hit an unanticipated show stopper. What now?

You go to your backup plan, that’s what. If your plan assumes that everything will go smoothly, what you have is a bad plan. Always, always, always account for trouble. This may be as minor as leaving out a few steps, or as major as a complete back-out of the entire process. This also means you must factor in enough time to undo what you have done. Good testing should make a complete back-out a rare occurrence, but proper testing is a learning process. So do yourself a favor and practice reinstalling an old operating system.

Communicate

Critical communication occurs on several different levels:

• Between applications staff and systems staff. The programmers will need to validate that applications still work properly on the new operating system.

• Between the datacenter and users. The users must be notified in advance that changes are coming, and how they affect system availability.

• Between you, the customer, and your vendors. Confirm compatibility; know when they are and are not available to help you; any on-line resources that may save your bacon; planning timely delivery of media.

• Among you and your fellow system managers. Learn from those who have gone before you, contribute your experience to save others from any pain you may have experienced or to share success stories.

Follow-Up

Some issues, like performance, will not be fully resolved until the update is in and normal user loading occurs. Consequently, your job is not complete when the last item in the update guide is checked off. Instead, you must continue to monitor the system for any unintended side effects. And you should also monitor the usual sources (the 3000-L newsgroup/mailing list, in particular) for any news about bugs and patches.

There is no good reason that an operating system update should be the root canal of system management. All it takes is organization, communication, common sense and a little luck.

Scott Hirsh, former chairman of the SIG-SYSMAN Special Interest Group, is a partner at Precision Systems Group, an authorized HP Channel Partner which consults on HP OpenView, Maestro, Sys*Admiral and other general HP 3000 and HP 9000 automation and administration practices.