Postmortem: Failing Over to our Backup Environment

Fire cabinet servers If you’ve been watching our Twitter and status page you’ll see that we’ve had major server issues over the last two days. As a FoxyCart store, you deserve a full and candid explanation of the downtime, we want to tell you what measures we’ve taken to ensure that your store stays up and running.

Timeline

On the afternoon of Wednesday, 3 October 2012, our monitoring systems notified us that our primary application server in our Dallas datacenter went offline. After confirming that the server was not responding, we called our host company’s support, and learned that an unexplained virtual disk configuration change had caused the issue, and that they were scrambling to repair it.

At this point, our DNS failover had already taken over (which happens automatically as soon as a major issue is detected), so any customers would have seen an “Our eCommerce functionality is currently down for maintenance” message. FoxyCart store admins were redirected to status.foxycart.com, where we posted updates about the issue.

According to the hosting company, the server had automatically restarted after the initial error, however, due a configuration issue, the server would not successfully boot.

At this point, we were very concerned about our ability to quickly get FoxyCart up and running again on the Dallas hardware. Brett, Luke, and I called an immediate conference to discuss our options. After confirmation from the hosting company that this issue could not be resolved as quickly as we needed to, we were left with one option: fail over to our backup environment.

The good things, at this point:

We had a backup full application environment (web server, database server, firewall, web application firewall, etc.) running in Arizona, and our database contents had already been replicating from the Texas environment.
I was working on testing our Arizona datacenter, with the plan to fail over to it within the next two weeks in order to perform much needed upgrades on our Dallas hardware.

Given those two things, we were able to fail over to the new servers and get FoxyCart customers once again selling in less than an hour.

The bad:

Dragons, as in “here be dragons.” We had to move really fast without the luxury of full testing.
We had definite issues on the Arizona servers that we hadn’t yet tested for. Subscriptions ran about 8 hours late. Sporadic and random application errors from the moment we switched traffic to failover (largely due to a faulty switch between the web and db servers). Sessions were running for a while from Redis at the Dallas datacenter, which slowed the site down. Non-English locales weren’t compiled properly, causing errors for some non-English stores.
Various other small issues that we worked out over the course of the next day.

I am immensely proud that we were able to get our systems up and going again as quickly as we could, and without having a full and complete plan. We really came together in the heat of the moment. While we did have issues, we kept working on them and were able to quickly sort them out.

What we did right: We had a failover system in the first place. It wasn’t 100%, but it was ready enough that we were able to get things moved and operational. Our hosting company brought Dallas back online about an hour after its initial failure, at which point we were already up and running. We have extensive monitoring on both our primary and failover environments which helped us make sure the new systems were every bit as good as the old.

What we did wrong: We hadn’t finished getting our failover system ready before the downtime. We hate downtime, and we knew that we had a single-point of failure in our Dallas DC, but we were working on getting the backup environment ready in addition to getting other things done. What should have happened was that we’d focused solely on getting our failover system “DONE” and ready to handle a takeover.

Conclusion, with Unexpected Silver Lining

As your ecommerce provider, we want to provide you with the best service, which means a service that stays up and handles problems quickly. While we did handle this downtime, we could have done better. To that end, we’re making these things our top infrastructure priorities:

Upgrades to our Dallas infrastructure
Complete tests to validate that the failover and primary environments are identical.
A complete and fully tested plan for failing over and failing back between our data centers.

We will let you know as we complete these items. It’s important to us that you know what’s happening “behind the scenes” so that you can know your store is in good hands.

As a result of moving to our newer Arizona servers we were able to upgrade the RAM and CPU of our application and database servers. We had already planned to do this, and seized the opportunity to beef up our servers’ specs. Where we had seen some slowdowns on the old hardware, the new systems are amazingly fast, handling all of the same traffic without breaking a sweat. I’ve never seen a more responsive FoxyCart.

Thank you for being a FoxyCart customer, we appreciate you, and we’re glad to have you on board as we grow and make these exciting changes.

Regards,
Fred Alger
IT Director, FoxyCart

Postmortem: Failing Over to our Backup Environment

Timeline

The good things, at this point:

The bad:

Conclusion, with Unexpected Silver Lining

The Myths of PCI Compliance: A “PCI Compliant Box”

Behaviour Driven Development (BDD) and FoxyCart

Foxy.io’s Tips for Better Security Vulnerability Reports

Postmortem: Failing Over to our Backup Environment

Timeline

The good things, at this point:

The bad:

Conclusion, with Unexpected Silver Lining

Related Posts

The Myths of PCI Compliance: A “PCI Compliant Box”

Behaviour Driven Development (BDD) and FoxyCart

Foxy.io’s Tips for Better Security Vulnerability Reports