I’m an IT guy, and I’ve seen my share of colossal failures in the workplace over the years. Recently there have been some “IT disaster” threads at Ars Technica and Reddit which got me thinking about my own disaster stories. Here are four of my favorite. Note that all but the last one come from the same job at a third-party IT company I used to work for.
THE RAID ARRAY FROM HELL
I was sent to a real estate firm to swap out a failing drive in a RAID 5 array. Thanks to the LEDs on each drive, I quickly spotted the drive I was to replace. I opened the RAID utility on the server to make extra-sure I was replacing the correct drive. The software verified that yes, the drive with the blinking LED is failing. I removed the old drive, put the new drive into the slide, and placed it in the array. The software recognized the new disk and asked if I wanted to rebuild the array. I clicked yes, and for the next 20 seconds or so everything seemed normal. But then the server BSOD’d. When I tried to reboot it I got the dreaded “SYSTEM DISK NOT FOUND” error message.
Come to find out, this server was one of the first my boss built himself after he started his company. For reasons only he knows, he installed Windows 2000 Server on to the RAID 5 array itself. Now this isn’t a “disaster” per se. The RAID software should have been able to rebuild itself without taking down the entire array. But installing an operating system onto a RAID 5 array is just something I’ve never seen done, ever. I’ve only worked with small and medium-sized businesses (SMB). In an SMB environment, you’d typically install Windows Server onto a regular hard drive or possibly a RAID 1 array. You then create the RAID 5 array as a separate disk to store vital data. And you do it this way because the operating system files just aren’t that valuable, and installing Windows on a standard (or RAID 1) drive is significantly less complicated (as a general IT rule, the fewer points of failure or complexity the better). If you have no idea what I’m talking about, imagine installing Windows on a regular hard drive, and putting all your important data on a heavy-duty, “guaranteed to never fail” external hard drive. If the Windows drive dies, it’s no big thing to go to Best Buy, get a new hard drive, reinstall Windows, then reinstall the external drive, right? Same theory, different implementation. And this real estate agency had tried to become as paperless as possible, so everything was on the server… which was now dead.
The icing on the cake was that the owner, an attorney with zero sense of humor but a giant sense of ego, flipped out because… “[my boss at the IT company] told me that we didn’t need backups because of this RAID thing!” I tried explaining that RAID is not a backup, just a way to make hard drives more fault tolerant. But he seemed to be of the opinion that my boss told him otherwise. Which put me in a pickle. Anyone who’s worked in IT knows that you can say one thing, even in as simple English as possible, and clients hear another. So it’s possible that my boss said no such thing, but the client interpreted it as such. On the other hand, I knew my boss would tell clients anything he felt they wanted to hear to make a sale. Perhaps my boss was afraid that the client wouldn’t sign the contract if he added a $1,200 tape drive into the mix. Maybe my boss was planning to sell him some kind of tape or online backup later on. Whatever the case, I had a dead server and a highly pissed off attorney to deal with. And it wasn’t pretty. I took the server back to the office and rebuilt it from scratch – not installing Windows Server on the RAID 5 array this time. My boss claimed to have recovered more that half the data off the old array… but the recovery software only pulled the file names; the actual files themselves were just a bunch of binary gibberish. So the firm started over from scratch.
LESSONS FROM THIS ORDEAL: RAID is not a backup. Don’t lie to clients, and make them understand, no matter what you have to do, what they’re signing up for.














