Incident Alerted
Incident Report for Olark Live Chat
Postmortem

A note from Olark CEO and co-founder Ben Congleton regarding the service issues that affected customers in the past few weeks:

Before I address the technical specifics, I want to acknowledge that we've heard and share your frustration. We know that you depend on Olark to be available and working properly every day. I'm sorry we let you down, and we are doing our very best to make it up to you and ensure we live up to the quality of service you expect.

In the spirit of transparency, I'd like to recap the incidents that occurred:

  • For the first time, about a month ago, we experienced a new behavior with one of our messaging clusters that caused two distinct periods of slow message delivery, dropped messages, and inaccurate agent availability.
  • To mitigate these impacts, we scheduled an emergency maintenance window for the messaging system
  • On Monday, 2/13, some customers experienced a brief issue that affected logging in to the chat console. From a technical perspective this was unrelated to message cluster issues, but ultimately prevented some agents from logging in for a period of time.
  • For approximately 2 hours on Tuesday (2/27) and 2 hours on Wednesday (2/28), we saw the original issue again with part of our messaging cluster. In both cases, our engineers gathered more data to determine root cause, and then restarted the cluster. As part of the second restart, we also expanded the capacity of the messaging cluster.

What we're doing to mitigate these issues:

  • We are treating this issue as our highest priority. Our engineering team is actively working to monitor, identify, and resolve the root cause of these issues, and we are optimistic that our mitigation efforts have limited future service impacts. We will continue to proactively communicate with you about any expected maintenance windows or other issues.
  • We are improving our technical response process. Even before we address root cause, we should be able to dramatically minimize the length of any similar outage in the future with a simple restart of the messaging cluster.
  • We've made some adjustments to the way we handle incident alerts that should add clarity and minimize disruption.

I know these assurances still don't rectify lost business, and I want to make this right. If you were affected by these issues and would like a credit for the time you were offline, please contact our team at support@olark.com and we will credit your account.

Do not hesitate to let me know if there's anything else I can do.

Thank you,

Ben Congleton, Chief Executive Olarker

Posted Mar 02, 2018 - 13:52 EST

Resolved
This incident has been resolved.
Posted Feb 28, 2018 - 18:14 EST
Update
We believe delivery of chat and email messages is working normally again and you should be able to use the service as normal at this time. We are monitoring the cause of these recent issues closely and are still in the process of implementing a solution to the root cause. It’s possible that there could be some residual stability issues.

Addressing the root cause of this issue is top-priority for our engineering team right now. We aim to resolve as soon as possible and will stay in touch with you proactively. A post mortem will be posted here once final resolution has been implemented.
Posted Feb 28, 2018 - 18:14 EST
Monitoring
A fix has been implemented and we are monitoring the results
Posted Feb 28, 2018 - 13:46 EST
Investigating
We're receiving reports of message and email delivery delays and are investigating. We'll update as soon as we know more.
Posted Feb 28, 2018 - 13:05 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 27, 2018 - 16:52 EST
Investigating
We've detected an issue and are working to resolve this quickly. We'll have an update within the hour.
Posted Feb 27, 2018 - 13:28 EST