A note from Olark CEO and co-founder Ben Congleton regarding the service issues that affected customers in the past few weeks:
Before I address the technical specifics, I want to acknowledge that we've heard and share your frustration. We know that you depend on Olark to be available and working properly every day. I'm sorry we let you down, and we are doing our very best to make it up to you and ensure we live up to the quality of service you expect.
In the spirit of transparency, I'd like to recap the incidents that occurred:
- For the first time, about a month ago, we experienced a new behavior with one of our messaging clusters that caused two distinct periods of slow message delivery, dropped messages, and inaccurate agent availability.
- To mitigate these impacts, we scheduled an emergency maintenance window for the messaging system
- On Monday, 2/13, some customers experienced a brief issue that affected logging in to the chat console. From a technical perspective this was unrelated to message cluster issues, but ultimately prevented some agents from logging in for a period of time.
- For approximately 2 hours on Tuesday (2/27) and 2 hours on Wednesday (2/28), we saw the original issue again with part of our messaging cluster. In both cases, our engineers gathered more data to determine root cause, and then restarted the cluster. As part of the second restart, we also expanded the capacity of the messaging cluster.
What we're doing to mitigate these issues:
- We are treating this issue as our highest priority. Our engineering team is actively working to monitor, identify, and resolve the root cause of these issues, and we are optimistic that our mitigation efforts have limited future service impacts. We will continue to proactively communicate with you about any expected maintenance windows or other issues.
- We are improving our technical response process. Even before we address root cause, we should be able to dramatically minimize the length of any similar outage in the future with a simple restart of the messaging cluster.
- We've made some adjustments to the way we handle incident alerts that should add clarity and minimize disruption.
I know these assurances still don't rectify lost business, and I want to make this right. If you were affected by these issues and would like a credit for the time you were offline, please contact our team at email@example.com and we will credit your account.
Do not hesitate to let me know if there's anything else I can do.
Ben Congleton, Chief Executive Olarker