During the first week of May, we had a 3 incidents in the course of 3 days, which resulted in users being unable to access Olark. We take pride in working to keep Olark stable, and accessible and we know that your business relies on us to keep chats flowing. On the 2nd, 3rd and 4th of May we failed to do that, and we apologise.
We want to take a moment to explain both what happened, and the measures we are putting in place to prevent a similar thing happening again.
On Wednesday, May 2nd at 8:50 PM UTC we were alerted to network packets being dropped by our cloud hosting service. Approximately 15 minutes later our hosting provider had updated their status page acknowledging that there was a networking issue on their end. At 9:15 PM UTC the issue was resolved by our hosting provider. Unfortunately due to the way our authentication infrastructure is set up, the outage persisted longer whilst our system processed the increased load of agents logging back in to chat.olark.com. Our servers recovered and we entered monitoring at 9:30 PM UTC.
On Thursday, May 3rd at 6:49 PM UTC, we were alerted to olark.com being unavailable. We immediately began investigating. While we were looking into the root cause, olark.com once again became available. Login however, remained down as our servers began working through the increased load of agents logging back in. Around this time some agents were logged out of chat.olark.com as a result of our rolling authentication process, and were unable to re-authenticate due to the increased load.
This combination of factors exacerbated the authentication issue and extended the time it took for our systems to recover. At 7:40 PM UTC we started rolling out additional capacity to handle the increased load. Around 7:48 PM UTC our services began recovering on their own before the additional capacity had finished being deployed. During the time our servers were recovering we narrowed down the cause to a failing health check in one of our services responsible for authentication and serving our website. Once found, a fix was deployed at 7:55 PM UTC.
On Friday, May 4th at 2:48 PM UTC, we were alerted to our dashboard and login being down. Investigation immediately began and found an increased load to our service providing subscription information for banners in the dashboard section of the website. By 3:00 PM UTC services had recovered and agents were able to login again. We began to roll out temporary fixes to reduce the load on impacted systems, however there was a subsequent issue in the deploy which caused dashboard pages to not load. We had a fix for the additional issue out and in production at 3:20 PM UTC and our system returned to normal. Additional capacity for services impacted, was deployed by 3:32 PM UTC, at which point, we entered monitoring.
Our systems have remained stable since these fixes were implemented.
Thank you for your understanding and patience during these outages. Everytime something like this happens, we do internal post-mortems and review our processes in order to prevent anything similar happening again. As a result of these, we have implemented a number of improvements
Updates have been made to our authentication systems to prevent minor disruptions from affecting agents ability to chat.
We have ensured that agent login is the highest priority by isolating our authentication services and increasing capacity.
Our service providing subscription data has been decoupled to ensure it doesn’t impact agent’s ability to login.
Once again, we apologize for the impact this had on your day to day work last week. I hope our actions and communication have helped in some way to demonstrate how much our team values your trust.
Nick Crohn, Director of Engineering