I’m emailing to update you on our service outages on Monday and Tuesday this week.
Service was fully restored at about 11:00am PST Tuesday and all systems are still stable as of this morning.
I know this has been a very frustrating and trying time for you as an Olark customer, and for that I apologize. Please know that, since Monday, our team has been working through the night to resolve two different incidents. (The post mortems on these incidents are here and here.)
This has been a tough two days knowing that we’ve let you down, and we want to make amends.
We failed to provide you with the service you deserve. I wish I could tell you this outage was unpredictable, or it was all an external party’s fault, but it wasn’t.
On Monday night, our upstream service provider experienced an unexpected outage caused by maintenance of its entire data center, which lasted for hours. By 9:06pm PST, the Olark team identified the network outage. At 10:44pm PST, the service provider acknowledged its routine maintenance had problems and was affecting its customers, including Olark. Once the issue was resolved on their end, we began to restart our servers at around midnight.
We have been aware that it was possible that a cascading reboot of Olark’s system could lead to an outage. This is the kind of exceptionally rare event that could only happen during a major data center disruption like the one on Monday night. We have in fact been working on hardening our system to this kind of risk for months.
That’s why we know it was preventable. In the end, we did not execute quickly enough to prevent these two issues from affecting you.
We feel no great irony in the fact the specific component that lead to this outage was scheduled to be replaced this week. The positive news is that we spent the last months rewriting how the particular servers affected today are set up. Had the servers been using this new set up, it would have helped avoid this issue. These updates are still due to be released imminently as they were scheduled to do so regardless of this particular outage.
You can rest assured, we are taking this seriously.
I realize that doesn’t make up for lost business Monday and Tuesday though.
As a mea culpa, we are issuing you 2 days worth of credit on your account. You should see that reflected in the next few days.
If you feel this isn’t sufficient, please let me know and we can discuss further – email@example.com
Please let me know if there’s anything else I can do to help,
Chief Executive Olarker, Olark
PS: You can subscribe to up to the minute service status updates at http://status.olark.com