This morning, a memory leak on our operator chat console put our cluster in a state where a restart was necessary.
A restart of these servers errantly triggered a separate code path to run, which subsequently disabled logins and authentication to our main chat server – effectively preventing operators from chatting. Customers started to report issues at 10:03am EST.
After investigation, we were able to revert the errant code on our chat servers allowing authentication to start working normally again. We gave the all-clear at 1:23pm EST after a period of monitoring.
Since this incident, we have taken measures to ensure the memory leak doesn't re-occur, such as improved logging and error handling, to prevent this from happening again. We are also putting a team together to add safeguards in ensuring similar issues do not occur in the future.