On Sunday November 24 from approximately 19:12 PST until 19:38 PST, and again on Monday November 25th from 08:23 PST until 09:57 PST, Olark experienced a partial outage that affected some agents’ online/offline status. During these windows, some agents’ status was set to offline involuntarily, and some agents received incoming chats while their status was set to away.
What happened: Our team was initially alerted by our internal monitoring systems of an issue with one of our event queuing nodes on Sunday around 19:00 PST. Shortly thereafter, customers began reporting issues with agent status. Initial investigation resulted in steps being taken by our team to restart the queuing node, and systems began to recover quickly. By 19:38 PST systems had fully returned to normal operation.
On Monday November 25, around 08:20 PST, we received new reports of agent status problems for some agents, and we began additional investigation. Initially this issue looked like a state mismatch due to caching in internal systems and we took steps to resolve that issue. However internal tests showed that the presence issue persisted for some agents after those resolutions. Additional steps were taken to once again restart the queuing node that caused the issues the previous day, resolving the status issue and returning systems to normal operation by 09:57 PST.
Why this happened: Our event queuing system is integral to the proper functioning of Olark. From our initial investigation, a single node in our event queuing cluster lost network connectivity, and required a hard reboot on Sunday night. Although this queuing system is designed for redundancy, this node did not disconnect cleanly from the cluster. When the node came back up after the first restart, all systems were verified to be working correctly in our testing and monitoring. Once it became clear Monday morning that there was still some remaining cluster state to resolve, we performed a managed restart of the node, which did additional work to force traffic to be redistributed properly to the other nodes. We concluded that the original restart left some cluster state unresolved, which was the cause of the issues on Monday morning.
How are we preventing this in the future: Our goal is to detect this kind of cluster state issue for immediate correction in the future. We already have significant monitoring at the infrastructure level, queue level, and application level. However, as a result of this incident, we are immediately this week: (1) adding new end-to-end regression tests for this specific application failure, and (2) asking our on-call engineers and support team to be extra vigilant over this Thanksgiving holiday week and increase frequency of manual human supervision of our systems. We have already increased the alert level for existing automated monitors, and we have also added new diagnostic tools for our on-call engineers to detect this specific issue.
Beyond those immediate monitoring and correction steps, we aim to prevent these issues from happening in the first place. If one node exits our event queuing cluster, remaining nodes should "heal" the cluster and immediately pick up the work without customer impact. Since healing did not occur correctly in this case, the highest priority for our engineering team is isolating a solution to ensure all disconnects and reconnects are clean. We have made this work a top priority for our engineering team to ensure we continue to have stability for this core component of our infrastructure.