Incident Alerted

Incident Report for Olark Live Chat

Postmortem

Summary of the Incident:

On January 29th around 10:30 am PST we began receiving reports that delivery of messages from agents to visitors were experiencing significant delays. We immediately investigated and determined that a component core to message delivery had become unstable. Components were restarted and chat was operating at full capacity again at approximately 2:30pm PST.

On the evening of January 29th and through January 30th some customers experienced residual issues with visitors appearing "stuck" in their visitor list. Our developers worked to manually clear these for customers as reports came in while also working on a more permanent fix.

On February 2nd, around 10am PST we began receiving new reports that message delivery was delayed in a similar fashion as on January 29th. Investigation revealed that the same component was experiencing issues again; the component was restarted and additional capacity added for redundancy during restarts and to aid in clearing message delivery queue backups quickly. Chat began operating at full capacity again at approximately 3pm PST on February 2nd.

The problem:

Components of our system that are normally very stable suffered a cascading failure that we're still finding a root cause for. We believe this was compounded by use of a relatively unsophisticated failover process, which we're now in the process of improving.

How we’re preventing this from happening in the future:

We're adding additional overhead and upgrading the cluster to use a more sophisticated failover mechanism that will make it more resilient in these types of circumstances.

Posted Feb 06, 2018 - 09:49 PST

Resolved

This incident has been resolved.

Posted Feb 02, 2018 - 15:28 PST

Monitoring

We believe we've resolved the issue causing chat message delays and visitor presence issues. We're continuing to monitor but believe we're back to operational. We'll update again when we're confident we've reached final resolution.

Posted Feb 01, 2018 - 12:34 PST

Update

We're investigating an issue with delayed messages between the chatbox and agent console. We'll update as we know more.

Posted Feb 01, 2018 - 10:32 PST

Investigating

We've detected an issue and are working to resolve this quickly. We'll have an update within the hour.

Posted Feb 01, 2018 - 10:24 PST