A Kicked Cable and the Department of Redundancy Department

Bad mistakes make better programmers

Some say life is made of moments and that people are the sum of their experiences. Looking back on my IT career brings back a flood of memories, including some moments and experiences I will simply refer to as lessons. The great part about lessons is, once learned, they become very useful guideposts, marking the boundaries between prudence and recklessness. The not-so-great part of lessons is that they are learned primarily from mistakes and learned best from the mistakes that hurt the worst.

Don’t get me wrong; we all make mistakes — just ask my ex-wives. My advice on making them is very simple: don’t make the same mistake twice. Learn grow, and do better next time. When you look at it, life at the macro level is really no different than coding (or any other pursuit, really), in that your ability to avoid making errors improves with each misstep.

This maxim applies to not only your own mistakes, but pertains equally to those of others. As I used to tell my students, don’t make my mistakes; be creative and make your own unique errors that teach you (and everyone you share your experience with) something new. This is why I like to share stories — these are some of the difficult lessons that I, or someone I helped, learned through the School of Hard Knocks, where I received my advanced education in troubleshooting, debugging, and workarounds.

Redundancy, American style

The first topic I want to talk about is redundancy. To be clear, I’m not talking about British style redundancy, where suddenly it’s time for an emergency CV update. I’m simply talking about having two or more of any critical component — hardware, software, people — that if suddenly unavailable, the business stops. Redundancy is key to maintaining fault-tolerance. This is true at every level, from individual hardware components (RAID, multiple NICs) to server clusters, from software backups and secondary data centers to the on-call support roster. If the Army taught me anything, it’s to prepare for the worst, while simultaneously hoping for the best.

Preparing for the worst requires, first, imagining the worst. This phase equates to essentially brainstorming doomsday scenarios to answer the usually rhetorical question, “What’s the worst that could happen?”, with the additional twist of, “Then what?”. In other words, how can I prevent, or at least be ready to mitigate or recover from, the disaster if the worst does happen? Yes, never forget that the D in DR means DISASTER.

One way to be ready is to have a backup, or at worst a backup plan, for dealing with the unplanned unavailability of any critical element. The business must continue to operate whether we have lost a hard drive, a server, or a system administrator. While I don’t wish to delve into the intricacies of risk management, the crux of it lies in classifying risks by likelihood and severity. Since planning for every eventuality is impossible, the idea is to prioritize the most likely and the most severe, balanced against the cost of mitigation.

For example, RAID and software backups protect against data loss from a corrupted or full disk (pretty likely, medium severity) at a reasonable cost, while establishing and testing full enterprise failover in case an entire data center goes offline (not very likely, very high severity) comes with a much higher price tag. In any case, it’s up to the business to evaluate each identified risk factor and balance it against the cost of an outage versus cost of prevention, mitigation, or recovery.

What could possibly go wrong?

What, indeed? Even though some have accused me of being pessimistic, my focus on risk is never about fearmongering. Due to my training and previous experiences, my inherent nature is to identify risks with the goal of developing prevention and mitigation strategies. In the Army, when setting up a defensive perimeter, we were trained to identify and prepare three fighting positions: primary, secondary, and fallback.

In the IT world, you can think of the defensive perimeter as your IT infrastructure and applications, along with the associated standards, policies, and procedures for preventing an outage. Likewise, the primary position represents situation normal — defenses are holding and all is well. In a situation where one server of a three-node cluster goes down, this would not be unexpected — think of this as your secondary fighting position. You’ve prepared for this; all you need to do is hunker down and fight it out until the battle is over (the server is back up) and retake any lost ground (achieve full recovery). Lastly, you have your fallback position, where your primary defenses have collapsed and your secondary position has also been overrun. For our cluster, this means we’ve lost a second server and can’t reach quorum, rendering the cluster useless. This is where planning and preparing are crucial. They make the difference between working all weekend to re-install and re-configure, and simply spinning up three new VMs (or containers), applying the images, and going out for a Friday night pub crawl.

The kick heard around the IT department

Maybe I’m just lucky, but I’ve seen some extremely intelligent people render multiple fail-safes useless through a single misstep. For instance, redundancy is rooted in the concept of eliminating all single points of failure — where if one piece fails then the whole system is down. To put it another way, it doesn’t help to back up your database to the same disk as your primary. If you lose the disk hosting your DB, you also lose your backup, so back up to at least a different physical drive if not a separate server. Sounds basic, right? I wish it were.

Case in point, I was meeting with the IT leadership of a company that shall not be named planning a proposed integration project, when suddenly system administrators started panicking and I heard someone shouting “Siebel is down! Siebel is down!”. Now, bear in mind, this was back in the days when hosting your own data center was commonplace, so the servers involved were in the basement and they quickly got to the root cause: the Oracle database backing Siebel had abruptly lost power, apparently in the midst of a disk write, corrupting the database files, which had to be restored from backup.

Naturally, the question arose of how the database server could suddenly lose power when it had redundant power supplies and the data center itself had both battery backup and a diesel generator. In this instance, the root cause boiled down to two minor slip-ups, which, separately, would probably have gone unnoticed. However, in conjunction, they ruined the weekend for several individuals. So, what were these two mistakes?

It takes two, baby

Although not unheard of, it’s rare for any single error to take down an entire system (or worse). Naturally, working in IT as long as I have, I’ve seen some doozies: the backhoe that took out our fiberoptic cable, and therefore our Internet connection, at a dot-com startup; the production Trading Networks server down when the DB disk filled up because the webMethods administrator thought the DBA was doing the archiving and vice versa. But, I digress. Back to the story.

The first thing that happened in this case was whoever ran the cables plugged both power supplies of the database server not into the same circuit, not even into the same outlet, but into the same power strip. Whether by design or intending to fix it later, this simple act created a ticking timebomb. From here, all it would take would be a power surge to trip the fuse on the power strip to down the server. Before that could happen, a network technician accidentally kicked the cable of the power strip, unplugging it and crashing the database harder than the Fast & Furious finale.

Lessons learned

Though I look back with a sense of humor and can joke about it now, I learned some very valuable lessons form incidents like this and others.

Literally anything can happen: Nobody expects to wake up to a misconfigured firewall blocking communications between the DMZ and backend servers. We were all surprised at not being able to log into the Integration Server because the Active Directory admin forgot to tell the webMethods admin about the server migration. Prepare as best you can; plan for what you can anticipate, prevent whatever you can to the best of your ability, mitigate the damage, and restore operations.
It’s the little things: There are extremely good reasons for being a stickler about details. Proper cable management would have prevented the “Siebel down” incident. Simple stakeholder communication would have stopped network or server migrations from disrupting the business. Double-check cable runs. Audit infrastructure and software systems to understand potential impacts of any changes, and involve stakeholders to prevent hiccups.
Housekeeping saves heartache: It’s mind boggling how many times I’ve been called on to “fix webMethods” only to find out the root cause is either a full database or a shortage of disk storage. People tend to forget about tasks such as data archiving, log file cleanup, and the like until it’s too late. Automate backups, cleanups, health checks, and whatever else you can to detect and recover from unforeseen events more rapidly.

While you can’t anticipate everything, you can be prepared for most eventualities by simply sticking to the the basics.

Until next time, viva, webMethods!

#lessons #mistakes #backups #redundancy #integration