Downtime Postmortem: the nitty-gritty nerd story

September 11 | the_kellbot

A few curious technically minded folks have wondered what exactly happened during our epic 30 hour downtime. We learned a lot, and have made some changes to help us going forward. Here's the most brief summary I can manage of what was a very long process.

TL;DR

A bug in the virtualization software caused the system to loop on itself until it ran out of resources and stopped responding. We rolled out a new VPS on different hardware and lived happily ever after.

The full story, as told by our Facebook status updates

Wednesday night was a little rough. For a few days leading up to the downtime, we noticed some huge spikes in system load. The whole system felt like it was grinding to a halt (spoiler: it was).

Some poking around in the logs showed us that we were getting a fair number of brute force attacks on the admin login page. These are pretty common, and we have pretty good security practices, so we weren't concerned about a breach, but the traffic spike was overloading things. By Wednesday night it was becoming a regular problem. We installed a few safeguards (including one called fail2ban which totally blocks anyone who has too many failed logins), things calmed down, and I went to bed.



Thursday morning I woke up to a panicked Ariel, who said we'd been down for eight hours. Our server set up was pretty fancy. Static content was served by nginx (and replicated on a CDN), while dynamic content was served by Apache. Memcached and mysql were also running on the same machine. The whole system was designed by a WordPress wizard, and it was a hot rod setup running on giant hardware. Unfortunately, not all of the support staff at our host, LiquidWeb, knew how to manage it. I myself am really more of a software person, and not much of a command line jockey. If our server was a Lamborghini, none of us knew how to drive stick.

When we did finally get in touch with a more senior sysop, Travis, it was clear things were not right on our machine. We further tweaked the firewalls to keep out unwanted traffic, but the server wouldn't stay up long enough to see what was going on. Finally, we set the firewall to block all traffic on all ports, allowing only Ariel, myself, and our sysop Travis through. The Empire went totally dark.

We enabled and disabled everything imaginable trying to isolate the problem, with no luck. We'd make changes, open the firewall, and then immediately drown in the flood of traffic. Then all of a sudden we were back, for a short while at least.

The mysterious uptime happened after Travis lowered the maximum number of http connections to 5. It defaults to 256. Any number above 5 and the machine promptly crashed under the load. So only 5 people at a time could load the site. Urgh.

Travis noticed some of our MySQL queries weren't completing, and I theorized that MySQL was to blame. We spun up a new virtual server just for MySQL, moved the database to it, and pointed the website at that.

We opened up the firewall and waited to see which server fell over. The web server "won." After some inspection, MySQL was given a clean bill of health. I have never been so sad to hear that a database wasn't corrupted. Back to square one.

At this point we were approaching 20 hours of downtime, and Travis's shift was ending. He handed us over to Chris, who with his fresh eyes noticed something we hadn't: the calls were coming from inside the house. The massive load being generated was coming mostly from our own IP address.

We initially suspected a rogue plugin. Plugins are a double edged sword: they save you hundreds of hours in development time but if they're poorly written or maintained they can drain system resources hard. We started with the most suspect plugins, and when that didn't change anything we got a little more hardcore. And a lot more worried. Even reverting to a stock WordPress theme didn't help (but did look super weird). "Shut down all the garbage mashers on the detention level! No, shut them all down!" We stripped WordPress down to a bare install. Nothing changed. I started wondering whether the Amish accepted converts.

Chris did intense server surgery, including rebuilding the Linux kernel and recompiling Apache. I started preparing a new server in case we had to leave the old one for dead. As I was struggling to get all our fancy things set up on the new server, Chris gave us some very honest advice: stick with a machine that's as close to stock as possible so it's easier to get help maintaining it. I scrapped the new server and started rolling out another one, this time keeping everything super simple.

When the rebuild failed, Chris started suspecting hardware issues and moved our virtual server to a new physical host machine. Same VPS instance, new metal. We opened the floodgates and the server promptly fell over. The good news was the new server was now ready. It was time to leave the old one behind. A third tech, Jack, helped tweak the new server to run well under the high load we experience.

Because the new server had a new IP address we had to update the DNS records, which takes an agonizingly long time to propagate out to everyone. I watched Chartbeat while Chris watched the server load. Traffic slowly built up to normal levels, and the server stayed up. I reenabled plugins, fixed some things that were off, and we watched and waited. We were back up.

Chris, who had pulled a double shift at LiquidWeb to help us, tried to figure out what had gone wrong. The log entries were full of calls to gettimeofday(), from requests originating from our own server, over and over and over. Although we don't have any 100% conclusive answers, both Chris and Travis suspect it was due to a bug in the virtualization software. There was clock drift between the host and guest operating systems, which on low traffic servers isn't a big deal and is periodically corrected. But ours was under such high load (between the brute force login attempt and our own legit traffic) that it couldn't correct fast enough and requests just kept piling up.

It was a pretty miserable day and a half, and the poor Offbeat Bride Tribe still has some lingering wonkiness. There are some silver linings though:

  • Our server is now way easier to administrate. This will make both working with it and getting help easier.
  • The new machine actually serves pages faster, thanks to Jack's insight that we'd be better served by FastCGI than DSO for php handling. FastCGI runs faster at the expense of RAM, but since we have a ton of RAM on our server it's a well made sacrifice.
  • We left behind a lot of old cruft on our server, which had been running since 2006.
  • As ridiculous as it looked, seeing the site on the default WordPress theme was actually pretty instructive and we'll be making some tweaks based on what we saw.
  • We now have a ready to go server image for the Empire in case we ever need to roll out another one.
  • Our case is now being used in the LiquidWeb training materials. The first tech assigned to our case was pretty unhelpful, but Travis, Chris, and Jack picked up the slack and stayed with us until we were back running.
  • I may be getting a walrus.

Thus endeth what I sincerely hope is the worst downtime in the history of the Empire; past, present, and future. The new environment (and much more recent version of PHP) means I will likely be chasing down bugs nonstop for the next week, but overall we're in a good place. Godspeed, little server.

    • Kellbot is on-staff so there's no need for fundraising, since paying her is part of the Empire's budget! 🙂 …that said, I do know that a certain Tribesmaid is crocheting a walrus puppet for her right now.

      8 agree
  1. This was pretty great. I didn't quite understand it, but I got the story. Like when I was a kid and my mother, aunts and grandmother told all the good gossip in Spanish; if you sit still and listen really hard you can figure out what went down (first words I remember actively learning in Spanish "novio/a" "escándalo" "divorcio"). Anyway, what I came up with is that you guys had to move houses because you had a haunted clock.

    …did I miss an important noun somewhere?

    30 agree
    • This comment made me laugh way harder then probably warranted, but it's been a super long day.

      Thanks Kellbot for sharing the nitty gritty. I am that crazy nerd person who really enjoys reading it (even if I don't always understand)

      3 agree
  2. "He handed us over to Chris, who with his fresh eyes noticed something we hadn't: the calls were coming from inside the house."

    AAAARRRGGGHHH it sounds like a babysitter horror movie!

    5 agree

Comments are closed.