Worst Practices: A Walk in the Minefield
It happens to the best of us. Were working on a system, doing something seemingly innocuous, when suddenly catastrophe strikes, leaving us confused and unsure of our abilities as System Managers.
I found myself in one of those situations recently, whose origins began in a real worst practice. So in the tradition of Jerry Pournelle the heres my column detailing how I screwed up and what I did to pull my fanny out of the fire school of self-promotion I share with you my walk in the HP 3000 minefield du jour. Oh, and please pardon my limp. The mine I stepped on relieved me of a few toes.
Configure Now, Pay Later
By now everyone is familiar with the term planned obsolescence, the marketing time bomb that ensures that whatever you buy today will not have lasting value. In the system management realm, there is a similar technique where a system is either underconfigured for short-term needs, or worse, configured in such a way that a collision with certain limitations of the operating system is possible if not probable. Whenever we, as system managers, configure in limitations or unwittingly tempt the operating system gods, we are planting mines that sooner or later someone will step on.
Sure, there are often extenuating circumstances (management wouldnt spend the money required to configure properly for growth) or lack of information (undocumented or unbelievable operating system bug). Nevertheless, your system is a disaster waiting to happen.
Examples of this include:
Underconfiguration: Running out of disk space during prime business hours; insufficient memory that hurts performance; lack of fault tolerance that eventually results in downtime.
Distribution of capabilities: SM all around! Everyone can do everything! Is it any wonder why production files and/or the system configuration change mysteriously?
Setup for failure: You configure in such a way that your system and its peripherals do not work properly under all circumstances.
Nice Day for a Stroll
My misfortune, on a system with 1,100 users logged on, was to pick door number 3. The system in question, configured by HP no less, has three DLT drives whose LDEVs are 1710, 1711 and 1713. In the process of implementing a new tape library system, I mounted an ANSI labeled tape, replied to the mount request at the console and kaboom. System halt, just like that.
Guess what? The labeled tape facility of MPE/iX to this very day is still in Compatibility Mode and cannot address LDEVs greater than 255. And when you try? System abort 2559 as soon as you hit return on your reply. According to the Response Center, there is a two-year old SR on this issue, but HP has no plans to rewrite the labeled tape facility into Native Mode. So as long as these (or your) LDEVs are greater than 255, a labeled tape operation will continue to crash the system.
So wheres the worst practice?
Whoever configured the system (HP in this case) planted the seeds of a system halt. I just happened to be the poor slob who stepped on this configuration land mine (at the cost of some credibility, not to mention almost two hours of downtime for 1,100 users). You would expect HP to be aware of this issue, and not configure devices in such a way that a system halt may result during certain normal operations.
Whats worse: LDEV numbers less than 255 were available for the configuration of these DLT drives, so putting these devices in harms way wasnt even necessary. And by the way, what happened to the convention of configuring tape devices as low LDEV numbers?
Vendor Worst Practice #1: Leaving an operating system land mine where unsuspecting users (like yours truly) could step on it. This worst practice is compounded by three additional ugly aspects. First, not advertising that this issue exists (a post to HP3000-L prompted not one I knew that response); second, treating what on all other operating systems is a normal function (labeled tapes) like a stepchild feature unworthy of the effort of a Native Mode conversion (is there any other platform that is alive and kicking where labeling tapes is considered exotic?); and third, not returning an error message or other non-fatal means of informing the user that LDEVs greater than 255 are not supported.
Vendor Worst Practice #2: Not including a warning in the tape library documentation that labeled tape operations are only safe on LDEVs less than 255. Thanks for the limp.
And for full disclosure, this consultant (and X-Files fan) should have remembered the System Managers motto trust no one. Not HP, not the tape library vendor, not even the client. As the expert who supposedly knows everything there is to know about the HP 3000 and MPE/iX, I should have known this problem exists. No excuses. Next time I check the Electronic Support Center and the HP3000-L archives first!
Fortunately, as this was obviously an obscure problem, the client was understanding about the outage. Realizing they had continued exposure to system halts, they have decided to reconfigure the DLTs to LDEVs in the 20-30 range in January 2000, then to resume the use of labeled tapes. Everyone is a little more wary of operating system surprises, and HPs system configuration activity will be scrutinized much more closely from now on.
Scott Hirsh, former chairman of the SIG-SYSMAN Special Interest Group, is a partner at Precision Systems Group (510.435.4529), an authorized HP Channel Partner which consults on HP OpenView, Maestro, Sys*Admiral and other general HP 3000 and HP 9000 automation and administration practices.
Copyright The 3000 NewsWire. All rights reserved.