Basecamp interruption of service
|
|
All the 37signals properties were offline for two hours this morning between 10AM and 12PM CST (16:00 to 18:00 GMT) as our load balancer blew out and knocked out the network connection for all our servers. No data was lost and the machines all kept running, but they weren’t accessible from the internet. We’re very, very sorry for this interruption of service. While we were able to report on the progress of this interruption through our http://status.37signals.com (all the products and 37signals.com pointed to that site during the majority of the outage) that’s a small consolation when you want to access data stored on our services right now. It was just not good enough. While we don’t have a formal service-level agreement (SLA), we still want to compensate anyone who felt they were negatively affected in their work because of this outage. Please write support@37signals.com and we’ll get that taken care of. Naturally, we’re going to have a long, serious talk with our service provider (Rackspace). They’re supposed to be the best in the business, but in this instance they failed us, so we in turn failed you. We’ll do everything we can to make sure that something as simple as a load balancer (or firewall or switch or any other network equipment) going bad does not cause two hours of downtime. Again, we’re truly sorry for this interruption. This is not how Fridays are supposed to be. |
|
|
David, Thanks for letting us know. GMT 16:00 to 17:00 on a Friday is a busy time for us issuing end of week work so it caused a few issues – but that’s life, we all lived through it! I’m not a technical expert but I remember downtime a couple of months ago (when a truck took out some power supply?). At the time you were talking about securing your disaster planning etc – can you update us on this now please. Thanks |
|
|
Another lesson: Make sure your service provider doesn’t have any other single points of failure waiting in the wings. Go over their architecture with a fine-toothed comb. These types of hard outages are really detrimental to 37 Signals becoming an enterprise-class service provider. |
|
|
We swear by Rackspace… though this past year, we’ve had two minor to major issues with them falling a bit short on service. Are they slipping? I’m a bit shocked to hear that they let a load balancer bring down a behemoth like basecamp… you’d think they know that they are not only letting down 37, but thousands of premium hosting “decision makers.” Thanks for the explanation David. If it’s on RS, there’s not much anyone can really say against 37 (in my opinion). (Chris S. are you familiar with RS? they are usually recognized as the leaders in outsourced enterprise hosting). One tiny suggestion. Maybe you should broadcast that status subdomain now and/or put the sales site(s) (the first place I always go in case of outage) on a different colo. |
|
|
http://www.rackspace.com/whyrackspace/network/ ;-) |
|
|
Stuff happens, and it was unfortunate that this occurred at such a busy time. Whatever the truth, it all leads to questions on the reliability of Rackspace and 37signals’ oversight of their vendor. In any regulated industry, like ours, a formal investigation is expected: what happened, what was affected, and, most importantly, what corrective actions have been taken on both Rackspace and 37signals’ part. I’d like to see this published – publicly and soon. Also, for future information, where should Basecamp users go to get system status information? Other users in my company came to me for that information but it would be nice to have it available more readily and obviously. An email to the account owner (maybe administrators, too) at the onset of these service interruptions would be immensely helpful. |
|
|
I think this is important for admins – I too had staff and clients asking what was going on and I could not answer. |
|
|
We redirected every account on our servers to http://status.37signals.com. In any case of downtime, all accounts will be referred to our status pages. We also have a Twitter feed that we updated several times during the downtime: http://twitter.com/37signals |
|
|
Hi Sarah,
That didn’t happen for my basecamp, Highrise or backpack accounts. WE huts got a page could not be found error message. |
|
|
Rich, If you weren’t getting redirected, that means that either your browser, your computer, or your upstream DNS provider was caching DNS results longer than we’ve specified for their maximum time to live. We set the time to live values on our records relatively low to give us the flexibility to make these types of changes on the fly in order to keep our customers informed of what’s going on in the event of a problem. You may want to ask your provider about their DNS service. -Mark |
|
|
Hi Mark, Thanks for your reply. I don’t really understand that but that’s not your fault. I know where the status page is now so it won’t be an issue in the future. I do think that you may wish to consider an email to admins with a link to the status page if it happens again. |
|
|
Hello 37S, Things are running smoothly at the moment and no data was lost, but 37S owes it to their customers to explain what corrections have been put into place if something like this were to happen again. Does 37S employ a formal disaster recovery process? Notification of the service interruption needs to be pushed to account owners and admins when these things occur. Do the folks at 37S disagree? -Jeff |
|
|
along these same lines; We are creating a email distribution list, for all our basecamp users and customers, In the event this happens again. Its just unfortunate that Basecamp does not have an export feature or integration with Highrise so we can get this done quicker. |
|
|
I agree that a notice needs to go out to account holders or admins. We are new adopters and some people were getting error messages while some were getting your update page. It seemed to depend on what their bookmark was to get in to our account. |
|
|
I think the offier to compensate was a good move (obviously not for business lost or missed, but you can only do so much) I still think 37Signals are rockstars…. |
|
|
Is it just me, or is there a problem with accessing any of the basecamp sites? Proxy Error I See there was some down time recently, is this related? _04:32am CDT (09:32 GMT) March 19, 2008 All 37signals sites were unreachable for approximately 8 minutes this morning. We were experiencing problems with the redundant firewall configuration that was recently deployed and had to perform an emergency swap of them. At this time all systems should be functioning normally._ |
|
|
Even we getting same error. At times it works as well. I am accessing basecamp from India. |
|
|
We get the same error message here, I’m afriad our full team is at a standstill until this is fixed. |
|
|
We’re experiencing the same error. |
|
|
It’s down for us too. Nothing for the past 15 minutes. Any update on this from 37Signals? |
|
|
Likewise here. Same Get/clients error. |
|
|
I’m getting proxy errors too Proxy Error http://status.37signals.com/ is showing all services are up and running. |
|
|
Same problem here from London. |
|
|
Status:”http://status.37signals.com/” very helpfully shows that all is online….. |
|
|
You aren’t alone – see this thread in Troubleshooting and Bug Reports |
