Incident Alerted
Incident Report for Olark Live Chat
Postmortem

As a communications provider serving businesses globally we at Olark know that our system performance and stability are of utmost concern to our customers. We strive to provide a high standard of reliability in our service, and when we fail we want to be transparent about what happened, and how we will work to prevent it from happening again. On Friday, July 17, 2020 Olark experienced a system outage beginning at approximately 17:00 UTC and continuing through 18:00 UTC. During this period chat messages were largely undeliverable and some customers experienced agent presence issues.

On Thursday, July 16, 2020 Olark engineering migrated a key caching system to a new platform. This migration was undertaken after extensive preparation and testing, including scalability and load testing that left us with a high degree of confidence that the new system would handle the operational loads placed upon it.

The following day, July 17, our engineering group monitored the system as traffic load increased and saw no issues with performance until approximately 16:50 UTC when CPU load on the system increased to levels that began to prevent normal functioning. Our monitoring systems detected the condition immediately, and we activated our internal response team. The root cause was identified quickly, and we began the process of rolling back to the previous caching platform, which had been retained in full operational condition as a fallback option.

The process of updating configuration and normalizing the state of the system consumed the bulk of the approximately 1 hour during which Olark systems were affected by this outage. By 18:00 UTC all systems were fully operational, however a very limited number of customers saw continued agent presence issues that were tracked down and corrected by the engineering team.

Olark is working with our platform provider to understand the failure mode that occurred, and we will not attempt the migration again until we understand what happened and how to prevent it from recurring. In addition, during the response to this incident we identified ways in which we can speed the recovery of our system from a failure of this kind. We will be working to implement those changes in the near future. We apologize to Olark customers affected by this outage.

Posted Jul 20, 2020 - 16:26 EDT

Resolved
This incident has been resolved.
Posted Jul 17, 2020 - 17:49 EDT
Monitoring
We have fixed the issue and are monitoring the situation for continued problems.
Posted Jul 17, 2020 - 14:12 EDT
Identified
We have identified the issue and are working on a fix for all customers.
Posted Jul 17, 2020 - 13:44 EDT
Update
We have detected an issue with core chat. All chatting customers may be effected. Our engineering team is investigating and we will continue to provide updates as we learn more.
Posted Jul 17, 2020 - 13:33 EDT
Investigating
We've detected an issue and are working to resolve this quickly. We'll have an update within the hour.
Posted Jul 17, 2020 - 13:19 EDT