On Sunday October 20th at approximately 4:10pm PST, Olark experienced a significant system outage that was not fully resolved until Monday morning approximately 10:15am PST. During this time, the majority of chats were affected and messages were not delivered or stored. We know Olark plays a critical role for your team and business, and we are deeply sorry for the impact this had on your operations earlier this week. We want to provide you an explanation of what happened, why, and the steps our team is taking to ensure this doesn't happen again.
What happened: Our team was first alerted to potential issues late afternoon on Sunday October 20th (4:10pm PST) after one of our event queuing nodes was rebooted on our service provider. After our engineers performed an initial investigation, our system showed that the node had rebooted successfully, and we did not turn up any other signals in our system showing service impact. Early the following Monday morning, our team was alerted again at approximately 7:00am PST, began investigating further with the help of customer-reported symptoms, and identified a drop in event throughput. We were then able to see specific events in the system were being queued but not processed; further investigation revealed that our message queuing cluster appeared to be in a "running" state but was not actually storing or processing the events it received. After restarting the cluster, we determined that the internal cluster state was corrupted and still could not store or process events, due to a rare bug in the cluster software where a single node reboot would result in a corrupted cluster state on disk. Once we identified this root cause, we were able to finally resolve it by recreating all of our event queues and rebooting the entire cluster and all clients. By prioritizing real-time systems first, we were able to restore chat functionality around 9:30am, with the rest of the system recovering over the next half hour or so.
Why this happened: Due to low weekend event levels, our engineers unfortunately did not detect the abnormal drop in event throughput during the alert on Sunday late afternoon. This is an oversight on our part, and we take full responsibility. Additionally, root-causing and resolution was slow, due to this being a rare bug in the event queuing cluster that we have not observed in any prior reboots. When this bug happens, the cluster reports being fully functional, and clients experience queuing as fully functional despite not being stored, so root-causing from symptoms unfortunately took additional time. In the past, we have relied on queue backup alerts; in this case the cluster was processing these events, but not populating the queues correctly and thus did not trigger our typical monitoring.
How we are preventing this in the future: Our highest priority is ensuring that our systems remain as stable as possible for our customers, and to that end, ensuring that our system alerts engineers to severe symptoms like this, without relying on cluster monitoring and queue backups. Our engineering team has identified better event-throughput metrics and end-to-end testing that will allow us to detect such a situation immediately in the future. We have also taken steps to improve our internal procedures for getting weekend alerts to full resolution.