Chat.olark.com experiencing some issues
Incident Report for Olark Live Chat
Postmortem

As you are aware, we experienced three interruptions to Olark service over the last week. Here is a short overview of what happened and why:

  • On Wednesday, November 16, we received reports that users were being kicked out of their Olark chat platform and could not log back in. This was due to a cascading failure we experienced caused by one third of our chat infrastructure failing at once, which overloaded our remaining infrastructure.

  • On Thursday, November 17, we began receiving reports that users were being kicked out of the chat.olark.com and could not log back in. Other users were unable to log in to start a new session. In this case the issue was caused by a defective piece of our chat infrastructure. When this defect was triggered, it caused our other related servers to be taken offline. We were able to load balance and correct this issue. We began the process of replacing the older infrastructure with newer and better-supported versions.

  • On Monday, November 21, users were once again being removed from their chat dashboards and could not login. We identified that the previous defect, which hadn’t yet been fully mitigated, was once again causing our servers to come offline. We deployed our newer infrastructure to production immediately, which stabilized the service.

What's been done to prevent this from happening again?

For the incident caused on November 16, we've increased our capacity by more than 30% in response to the cascading failure.

Yesterday (Monday, November 21) we scheduled and completed maintenance that rolled out a new version of our chat infrastructure and directly addresses the identified defect.

A message from our COO, Matt Pizzimenti:

We realize these downtimes raise a lot of questions about our system stability as we move into a busy shopping season. Rest assured we have corrected the issues that caused these recent outages, and our engineering team's highest priority in the coming weeks is monitoring and maintaining system stability.

Our team appreciates your patience and understanding; we do our best to be transparent and active when it comes to issues that may affect service. As always, you can chat with us on Olark.com, and feel free to ask for me if you need to discuss further!

Posted Nov 22, 2016 - 16:19 EST

Resolved
The issue with chat.olark.com is now resolved. Thanks again for holding in there with us!
Posted Nov 21, 2016 - 16:09 EST
Monitoring
Chat.olark.com seems to be stabilizing. We are monitoring our systems until we are confident we are in the all clear. Thank you all for your patience.
Posted Nov 21, 2016 - 15:52 EST
Update
We're still investigating the issue with chat.olark.com. Some users may be able to login but we are not confident everything will be 100% stable. Thanks for the patience while we sort this out.
Posted Nov 21, 2016 - 14:57 EST
Investigating
Chat.olark.com is still experiencing some odd behaviour. We've moved back to a partial outage until we can be certain service is reliable.
Posted Nov 21, 2016 - 13:58 EST
Monitoring
Chat.olark.com should be back up and running. We are currently monitoring the situation. If you notice any issues with presence issues please come chat with us. Sorry for the strange behaviour.
Posted Nov 21, 2016 - 13:16 EST
Investigating
Chat.olark.com has forced some users offline. We are currently investigating the issue. Some users have been able to log off and log back on successfully.
Posted Nov 21, 2016 - 12:56 EST