Beginning Friday, May 9th, we detected elevated disconnects affecting our Classic (legacy) agent console. Agents reported unexpected disconnections from the Classic (chat.olark.com) agent console while logged in. Agents could reconnect by refreshing the page. These disconnections only affected our Classic (legacy) app; our modern app was not affected by this issue.
After initial investigations, our engineering team decided to improve disconnection retry logic in the Classic agent console; this unfortunately exacerbated the problem for those agents and we rolled back those changes.
In an effort to avoid taking Olark Classic completely offline for a full maintenance window, our engineering team began rebooting and clearing state in individual systems incrementally. This helped, but disconnect rates remained elevated.
Ultimately, we ended up re-provisioning and upgrading the cluster where our Classic console runs to forcibly clear all possible network state, and to eliminate the potential of underlying Google Cloud network issues that could cause these interruptions. After the cluster upgrade completed, we saw network interruptions slowly return to normal levels.
Final mitigations were in place by the afternoon of Friday, May 23rd, and we monitored the rate of disconnects closely over the holiday weekend. By Tuesday morning, May 27, we were confident that systems were fully stable and marked the incident resolved.
Our engineering team has placed new monitoring in place to detect similar issues in the future, and we have a runbook to re-provision the relevant systems if this occurs again.
Additionally, our customer service team is actively working with legacy customers (who are still using our Classic agent console) to smoothly migrate to our new app for an improved agent experience.