At 9:06pm PST on 25th January, we identified a network outage caused by maintenance performed by an upstream service provider. This was unrelated to the outage affecting logins to the Olark chat service yesterday. At 10:44pm PST, the service provider acknowledged their routine maintenance had problems and was affecting customers, including Olark. Once it was resolved on their end, we began to restart servers at around midnight.
Over the next hours we were seeing an increased number of errors and reports of missed messages between visitors and operators. By 8am PST on 26th January, we had identified an incorrect server configuration on a portion of our backend cluster that was producing the errors. We're continuing to investigate what caused that to happen, but it was triggered by the restarting of servers following the maintenance of the upstream service provider.
At 9:25am PST we had come up with a strategy to update and restart the affected servers with the correct configuration. This process finished at 10:12am PST and after a period of monitoring, we resolved the issue at 10:55am PST.
In terms of what can be done in the future, the positive news is that we spent the last few months rewriting how the particular servers affected today are set up. Had the servers been using this new set up, it would have helped avoid this issue. These updates are due to be released imminently and were scheduled to do so regardless of this particular outage.