Heavy duty work on a harbinger
By Steve Hammond
Kids call him the boogeyman, football players in training camp call him the Turk, WWII combatants called him a gremlin and countless teen slasher movies have been written with him as their raison detre. He is the way of giving a face to misfortune and/or evil that exists in the world. Whoever he is, he visited me on a Thursday evening a couple of months ago.
The harbinger of doom came via a call on my cell phone from my evening operator as I walked to my car in the parking lot. Hey, Steve, it started, but I could tell by his tone of voice he wasnt calling me for my bread pudding recipe. I was getting ready to start the backup and the 987 went down. The combination of the phrases getting ready to start the backup and went down in the same sentence is never what you want to hear. Its never right after the backup finished, it went down. No, its always prior to the backup.
Since the 987 is running minimal production applications, I waited until I got home to call my hardware support vendor. He dialed into the computer, got it back up, did some checking and told me it looked like a disk was going bad. Worst case scenario was a full reload. He thought the system would stay up long enough for us to try to get a full backup and to do some investigation.
We knew the bad device (LDEV 47) and it was now time for me to don my cape and my shirt with the red SysMan on the front and get to work. It was also time to crack open my toolkit and start some heavy duty work with MPEX.
Since 47 was the bad device, I could look at the files on LDEV 47 and see if anything really important was there. And yes I realize that if the device was bad the file label could be corrupt already and any information could be unreliable. I forged forward thinking if labels are corrupt, the files are toast anyway and the full backup would be of no help. But if the problem did not cause file corruption (as was ultimately the case), any information gathered at this time would be useful in determining the damage when the system was fully functional again. The first command I ran was:
Well, by the time 30 pages scrolled past, I realized I needed to narrow my parameters. Next try was:
The last full dump had been five days ago, so the TODAY-5 looks for anything modified in the last five days. For some bizarre reason I can no longer remember, I once had to do a TODAY-1000 and it worked! Who knows what limits there are for MPEX?
Again, I got a larger number of files than I anticipated over 300. Who knew? This system really only has one production system left on it. Maybe LDEV 47 was going bad because it was doing all the work. Finally, I created a file called TARG that consisted of:
Those were the four accounts that were crucial to getting the production systems up and running once the drive was replaced. So this time, I tried:
The result was a manageable number of 15-20. I copied the results into a e-mail, sent it to my programmer and asked that he check the integrity of the files.
At this point, we toyed with the idea of moving those files off of LDEV 47 to another device with the ALTFILE command. We could have created another indirect file (called TARG2) with all the file names in it and done:
%ALTFILE ^TARG2;dev=(either a device number or a device class)
Then the files would have been off the bad drive, but if they were already damaged, moving the file does not heal it, so we chose to just do the full backup and determine the situation when it was complete.
We were lucky and the full backup completed successfully. At that point, before the repair technician got there, I did a little more MPEX remediation. This work was in anticipation of bringing the system back after the repairs. Even though my 987 is not heavily used, at most times I have 11 jobs scheduled. These jobs move data between the 987 and servers on the network for various applications and are usually put into the scheduled queue by the same job, running the night before, issuing the STREAM...;IN=1,0,0 command just before EOJ. I gave up long ago trying to keep track of all these jobs and put them into my STARTUP.PUB.SYS job because the programmers would change them so often. Instead, when I have the luxury, just before I take the system down, I issue:
That saves all the details I need to restart the jobs, when the system comes back up with %DOSAVED;file=filename. The other command related to these is %SHOWSAVED;file=filename. This will show the jobs that are saved and will be executed.
As luck, or bad luck, would have it, these efforts were moot since the drive had to be replaced, so a reload was required thats right no private volumes, no mirrors, management has been no frills on the 987 for years. But I like to do one more thing when I reload my system. I do it in steps. The first is a restore that does:
Followed by restore *FULLDUMP;@.@.@-@.ROBELLE-@.@.SYS-@.@.VESOFT;...
Sometimes I add the REGO account to the first restore, but the reason is selfish. I like to monitor the reload as it progresses and the tools offered by Vesoft, Robelle and Adager help me do that and check the condition of files, databases and account structure. Its nice to know that the databases in the FINANCE account are not damaged before I gave the system back to the users. And on the other side of the fence, if I find a corrupt database in FINANCE, I can be fixing it while the rest of the restore continues through the rest of the alphabet.
I could say the disk drive failure was a learning experience, but it wasnt it was a pain in the butt and I still rue the day a former boss got that great deal on disk drives. But thats a column on a totally different topic.
Steve Hammond, who works for a professional association in Washington, DC, thinks he may have just written a Worst Practices column for Scott Hirsh.
Copyright The 3000 NewsWire. All rights reserved.