I’m an IT guy, and I’ve seen my share of colossal failures in the workplace over the years. Recently there have been some “IT disaster” threads at Ars Technica and Reddit which got me thinking about my own disaster stories. Here are four of my favorite. Note that all but the last one come from the same job at a third-party IT company I used to work for.
THE RAID ARRAY FROM HELL
I was sent to a real estate firm to swap out a failing drive in a RAID 5 array. Thanks to the LEDs on each drive, I quickly spotted the drive I was to replace. I opened the RAID utility on the server to make extra-sure I was replacing the correct drive. The software verified that yes, the drive with the blinking LED is failing. I removed the old drive, put the new drive into the slide, and placed it in the array. The software recognized the new disk and asked if I wanted to rebuild the array. I clicked yes, and for the next 20 seconds or so everything seemed normal. But then the server BSOD’d. When I tried to reboot it I got the dreaded “SYSTEM DISK NOT FOUND” error message.
Come to find out, this server was one of the first my boss built himself after he started his company. For reasons only he knows, he installed Windows 2000 Server on to the RAID 5 array itself. Now this isn’t a “disaster” per se. The RAID software should have been able to rebuild itself without taking down the entire array. But installing an operating system onto a RAID 5 array is just something I’ve never seen done, ever. I’ve only worked with small and medium-sized businesses (SMB). In an SMB environment, you’d typically install Windows Server onto a regular hard drive or possibly a RAID 1 array. You then create the RAID 5 array as a separate disk to store vital data. And you do it this way because the operating system files just aren’t that valuable, and installing Windows on a standard (or RAID 1) drive is significantly less complicated (as a general IT rule, the fewer points of failure or complexity the better). If you have no idea what I’m talking about, imagine installing Windows on a regular hard drive, and putting all your important data on a heavy-duty, “guaranteed to never fail” external hard drive. If the Windows drive dies, it’s no big thing to go to Best Buy, get a new hard drive, reinstall Windows, then reinstall the external drive, right? Same theory, different implementation. And this real estate agency had tried to become as paperless as possible, so everything was on the server… which was now dead.
The icing on the cake was that the owner, an attorney with zero sense of humor but a giant sense of ego, flipped out because… “[my boss at the IT company] told me that we didn’t need backups because of this RAID thing!” I tried explaining that RAID is not a backup, just a way to make hard drives more fault tolerant. But he seemed to be of the opinion that my boss told him otherwise. Which put me in a pickle. Anyone who’s worked in IT knows that you can say one thing, even in as simple English as possible, and clients hear another. So it’s possible that my boss said no such thing, but the client interpreted it as such. On the other hand, I knew my boss would tell clients anything he felt they wanted to hear to make a sale. Perhaps my boss was afraid that the client wouldn’t sign the contract if he added a $1,200 tape drive into the mix. Maybe my boss was planning to sell him some kind of tape or online backup later on. Whatever the case, I had a dead server and a highly pissed off attorney to deal with. And it wasn’t pretty. I took the server back to the office and rebuilt it from scratch – not installing Windows Server on the RAID 5 array this time. My boss claimed to have recovered more that half the data off the old array… but the recovery software only pulled the file names; the actual files themselves were just a bunch of binary gibberish. So the firm started over from scratch.
LESSONS FROM THIS ORDEAL: RAID is not a backup. Don’t lie to clients, and make them understand, no matter what you have to do, what they’re signing up for.
* * *
CHERRY PICKING DOESN’T WORK
One of our clients was an attorney who owned a firm with 7-8 other attorneys. This guy loved gadgets. His favorite was a 60″ plasma TV that was hidden in the conference room; when you pushed a button on a remote, the TV would rise up out of a credenza, like something out of a James Bond movie. It might seem ho-hum now, but it was really cool (and God knows how expensive) back in 2003.
The attorney read lots of computer magazines, and his office was littered with eWeek, Maximum PC and several other enthusiast rags. So when the time came to get a new server, he decided to build one from scratch. He spent weeks researching each and every part, determined to buy the very best parts he could. A standard, $25 LiteOn CD-ROM drive wouldn’t do: he had to have the $200 Plextor XXXTREME SCSI DVD drive. Standard server RAM wouldn’t do, either. He had to have gold-plated XXXTREME RAM inspected by the chief engineer of Samsung himself. I swear to God, if Corsair made non-conductive, high-performance titanium motherboard risers, this guy would have bought them, even if they were $300. Each.
As anyone who has dealt with a lot of PC hardware can tell you, just because every single part of a computer is top-notch, that doesn’t mean that those components will work well together. Although this guy had built a server with the most expensive, highest-quality motherboard Asus made and the highest-rated SCSI card available on the open market anywhere in the world… they just didn’t get along that well. One time I was sent there to add some type of PCI card to the server. Because the PCI slot was covered by the SCSI cables, I powered the server off, removed the SCSI cable, installed the device, put the SCSI cable in the exact same location, tested that everything was tight and secure, and powered the server back on. I was rewarded with:
Windows 2000 could not start because the following file is missing or corrupt:
Awesome! I tried power-cycling the server several times, and it eventually booted. Everything seemed A-OK from that point on, so I went back to the office and warned everyone about the flaky SCSI card on the server.
A few weeks later I was sent back, this time to install (ironically enough) a tape drive. I powered the server off, installed the SCSI tape drive, and powered the server back on… and got the same message as before. I frantically tried power-cycling the server and tried various combinations with and without the new tape drive. But the server just wouldn’t not boot. I soon became so worried that I called my boss, and he and one of his senior tech guys came out to have a look.
I knew from the error message that the Registry had somehow become corrupted, and that you could easily make the server boot again if you restored the hives from the backups Windows keeps on the drive. I even found and printed out an MS KB article that said as much. But my boss didn’t want to hear that. He was a big believer in the “shotgun approach” to IT repair. He wiped the server and reinstalled Windows Server from scratch. And, of course, since the new domain had a different SID than the old domain, all the desktops had to be joined to the new domain. I knew full well how to “recycle” old profiles instead of forcing users to have all new profiles… but my boss was so pissed at me by that point (even though I hadn’t really done anything wrong) that he insisted that we copy everyone’s profile to another server, manually join them to the new domain, and copy everything back. It took us almost all night long to do this – I remember the sun starting to peek up from the horizon as I left. And the attorney was pissed.
LESSONS FROM THIS ORDEAL: Unless you have some specific need that Dell, HP or IBM can’t meet, always buy your servers from a major OEM. They’ve tested that shit and know that it all works together.
* * *
THE SOFTWARE UPDATE FROM HELL
One thing that drove me absolutely insane about my old boss was that he’d frequently send people out to fix or upgrade things they’d never dealt with before. “Oh, it’s easy,” he’d say. “Just do this and this and you’re good”. Needless to say, I was very uncomfortable working with software or hardware I’ve never seen before on clients’ production machines. It seemed thoughtless at best and downright dangerous at worst. Here’s a great example why:
We had a client in High Point, North Carolina. That’s about an hour and a half away. These folks used Timberline, accounting software optimized for construction firms (the original software maker was bought out, and the new “Sage 300 Construction and Real Estate” does much more, like project planning, estimates, etc. At the time, it was mostly an accounting thing).
My boss decided to send me up there to upgrade their software from something like version 7.1.233 to 7.1.366. I told my boss that I wasn’t comfortable with this, having never heard of the software before that very minute. As always, he said that it was no problem, and he even had a co-worker come with me to make sure it went smoothly.
My co-worker drove, so I was able to read Timberline’s upgrade instructions two or three times on the trip there. And, once we arrived, I did a walk-through on the server to make sure that there were no surprises. So finally, around 30 minutes in, I ran the update. It said it was successful, so… cool.
My co-worker and I then tried updating all the clients. But there was a problem. When you started Timberline, the client software said “HEY! THERE’S AN UPDATE! DO YOU WANT TO INSTALL IT?” You’d click yes, and the update would install. Then it would tell you that Timberline needed to re restarted. But when you restarted, you’d get the same “HEY! THERE’S AN UPDATE” message. The clients were all caught in a update loop, and it seemed impossible to fix.
I’d backed up all the Timberline files on the server, so I deleted the updated ones, restored the old ones, then ran the update again. No dice; same problem. I tried calling Timberline support, but by now it was past 21:00, and support was closed. I called my boss, but he had no ideas. My co-worker and I futzed with the $#&@# software until midnight, and never could get it working again, even when we tried going back to the old configuration.
Needless to say, the client was cheesed: not only were they forced to stay with us until midnight, their entire business had come to a halt. With no other ideas and a client close to becoming physically hostile to me, I felt I had no choice but to leave. I called my boss (waking him up, always a morale booster) and told him what had happened.
He went up there the next morning, but had the same problem. So (unbeknownst to me) he called Timberline support and found out that, for some reason, you can only install updates from a client computer. Installing them on the actual server, which is what I was doing, didn’t work. So he rolled back to my original backup, installed the update from a client machine… and all was good. He then installed the updates on the other clients, and all was still good.
When we finally saw each other back at the office, he mildly chewed me out for screwing it up so badly. I showed him the upgrade instructions – the actual instructions Timberline had included with the upgrade disc, printed on heavy card stock – that made no mention of having to install the update from a client computer. My boss acted as if he knew all along that that’s what you had to do. I just wanted the ordeal to be over, so I didn’t ask why he hadn’t mentioned that the night before. It wasn’t until a few days later when I returned to the office and found a follow-up email from Timberline support that I found out he hadn’t known what to do, either.
LESSONS FROM THIS ORDEAL: Training your employees is good.
* * *
OUTLOOK GONE WRONG
This wasn’t exactly a “disaster”, but the fallout from it has irritated me for years.
Outlook 97 was a truly awful app. It was so bad that Microsoft quickly released Outlook 98 and made the update available for free on its website. I worked for a company that had around 1,500 users who were licensed for Outlook 97. But the other desktop support guy and I made a pact to install Outlook 98 on as many computers as possible, since installing it really cut down on our help desk calls.
And so, for several months, things were mostly groovy with Outlook. But then the IT bosses decided to upgrade all desktops to Outlook 2000 (and just Outlook 2000; the rest of the suite would remain Office 97). They pushed out the update via Group Policy without really testing it. And that’s when I discovered a fairly major problem: most users had Outlook set up to use the internal text editor. But a large minority had Outlook set up to use Word as the text editor. And, after the upgrade, those people were uniformly getting this error message when opening a new email:
Microsoft Word is set to be your e-mail editor. However, Word is unavailable, not installed, or is not the same version as Outlook. The Outlook e-mail editor will be used instead. An OLE registration error occurred. The program is not correctly installed. Run Setup again for the program.
I looked online and discovered that it was a known issue that Microsoft was looking at. At the time, there was no fix available, other than to use the internal editor. So I sent an email to everyone else on the local IT staff that basically said as much. Thirty minutes later I got a reply from my boss which said something like “I don’t really appreciate you sending out that email, Jim. You should be looking for SOLUTIONS, not telling us all your PROBLEMS.”
I was floored. I was a lowly contractor, and my co-worker and I were used to being treated like dirt. The IT guys who were actually employed by the company normally only concerned themselves with high-end issues, leaving the “little stuff” to guys like us. But this was simply beyond the pale. I had sent an email to my boss as a friendly “hey, there’s a problem with the rollout, and it can’t be fixed but MS should have a hotfix soon” kind of heads-up, and here’s my boss being a jerk about it. What’s more, he was one of those type-A personalities (endemic in the IT management world) who like to say “it’s always better to ask forgiveness than permission”. And here I was, basically “asking” for forgiveness (for a problem I didn’t even cause!) and getting chewed out for it.
I didn’t last much longer at that job. For one thing, my boss kept turning into a bigger and bigger jerk. He seemed to relish asking me for something impossible (“hey, would you install Windows 2000 on this Alpha server?”) and then becoming “outraged” when it didn’t work. And the company had moved from their original location, too. Although it was only 5 or 6 miles from the old location, it almost doubled my commute to an hour and fifteen minutes to an hour and a half (one way) every day. So we “mutually agreed” that I wouldn’t come back when my contract ended. The joke was on him, though: two months after I quit\was fired, the company shut down that location and moved it to Mexico.
LESSONS FROM THIS ORDEAL: You’re a manager? Don’t be a dick.