As a communications provider serving businesses globally we at Olark know that our system performance and stability are of utmost concern to our customers. We strive to provide a high standard of reliability in our service, and when we fail we want to be transparent about what happened, and how we will work to prevent it from happening again. On Friday, July 17, 2020 Olark experienced a system outage beginning at approximately 17:00 UTC and continuing through 18:00 UTC. During this period chat messages were largely undeliverable and some customers experienced agent presence issues.
On Thursday, July 16, 2020 Olark engineering migrated a key caching system to a new platform. This migration was undertaken after extensive preparation and testing, including scalability and load testing that left us with a high degree of confidence that the new system would handle the operational loads placed upon it.
The following day, July 17, our engineering group monitored the system as traffic load increased and saw no issues with performance until approximately 16:50 UTC when CPU load on the system increased to levels that began to prevent normal functioning. Our monitoring systems detected the condition immediately, and we activated our internal response team. The root cause was identified quickly, and we began the process of rolling back to the previous caching platform, which had been retained in full operational condition as a fallback option.
The process of updating configuration and normalizing the state of the system consumed the bulk of the approximately 1 hour during which Olark systems were affected by this outage. By 18:00 UTC all systems were fully operational, however a very limited number of customers saw continued agent presence issues that were tracked down and corrected by the engineering team.
Olark is working with our platform provider to understand the failure mode that occurred, and we will not attempt the migration again until we understand what happened and how to prevent it from recurring. In addition, during the response to this incident we identified ways in which we can speed the recovery of our system from a failure of this kind. We will be working to implement those changes in the near future. We apologize to Olark customers affected by this outage.