We've learnt our lesson
On September 26th 2011, ProsePoint Express suffered its first major unscheduled outage.
An intermittently faulty disk caused the service to go down for 30 minutes. In the scramble that followed, we quickly restored the service without really knowing the root cause. After more investigation, we decided that ProsePoint Express couldn't continue operating uninterrupted with any degree of confidence. It was decided to conduct emergency maintenance to replace the faulty disk (Actually, we replaced the server containing the faulty disk, but the effect is the same). That resulted in an additional 50 minutes of unscheduled outage.
After the emergency maintenance, ProsePoint Express was restored back to normal.
This outage was a rude shock. Sure, it can happen to anyone - even Amazon and Google have had highly visible outages in the past. We just didn't think it'd happen to us. We have always prided ourselves on our uptime, so this came as an unwelcome surprise. It also exposed gaps in our recovery planning.
Before going further, I'd like to express my apologies to our ProsePoint Express users for the downtime. Please be assured your data was safe throughout the event.
In hindsight, it shouldn't have happened, but it did. It has. So we have to learn from it.
Since the event, we have been busy planning and taking measures to make sure it doesn't happen again. Now that we have a concrete plan, it's time to communicate it to our users (which is also why we've taken this long to write this up - because we weren't decided yet what to do).
ProsePoint Express currently runs out of a data centre in New Jersey. We will maintain a secondary service in a geographically diverse data centre (ie. some distance away) as a 'hot spare'. The secondary service will mirror the main data centre (albeit with a bit of delay). In the event of an outage to ProsePoint Express (whether due to scheduled maintenance, a hardware failure, or a denial-of-service attack), we will switch to the secondary service. Hence, all user sites will remain available and online whilst we work on recovery.
With this plan, user sites will be much more highly available. It doubles our hosting costs, but that's a price we're willing to pay. We can tolerate ProsePoint Express being down, but we can't have our user sites being down.
Over the next few days and weeks, we will be implementing the infrastructure for this. We do not want another outage affecting ProsePoint Express or our users again.
Thank you for reading,