Internal networking issues affecting chat

Incident Report for Olark Live Chat

Postmortem

At 9:06pm PST on 25th January, we identified a network outage caused by maintenance performed by an upstream service provider. This was unrelated to the outage affecting logins to the Olark chat service yesterday. At 10:44pm PST, the service provider acknowledged their routine maintenance had problems and was affecting customers, including Olark. Once it was resolved on their end, we began to restart servers at around midnight.

Over the next hours we were seeing an increased number of errors and reports of missed messages between visitors and operators. By 8am PST on 26th January, we had identified an incorrect server configuration on a portion of our backend cluster that was producing the errors. We're continuing to investigate what caused that to happen, but it was triggered by the restarting of servers following the maintenance of the upstream service provider.

At 9:25am PST we had come up with a strategy to update and restart the affected servers with the correct configuration. This process finished at 10:12am PST and after a period of monitoring, we resolved the issue at 10:55am PST.

In terms of what can be done in the future, the positive news is that we spent the last few months rewriting how the particular servers affected today are set up. Had the servers been using this new set up, it would have helped avoid this issue. These updates are due to be released imminently and were scheduled to do so regardless of this particular outage.

Posted Jan 26, 2016 - 16:48 EST

Resolved

We're resolving this issue as our metrics continue to look normal and we aren't seeing any further reports of problems.

We'll be providing a postmortem of the events that led to the outage as well as how we might avoid this in the future.

From a support point of view, we'd want to apologize for any delay in our response to emails or tweets. Our helpdesk provider was experiencing issues at the same time, making it difficult to reply to tickets in a timely manner.

Posted Jan 26, 2016 - 13:55 EST

Monitoring

We have now re-started each of the affected servers. Our metrics are returning to normal, which means that most customers should no longer be affected by this issue.

We are continuing to monitor all metrics before giving the all-clear, where we will look to provide a postmortem of events.

Posted Jan 26, 2016 - 13:16 EST

Identified

We believe we have identified the issue and are restarting some servers with a potential solution in place. Customers should start to see improvements in service over the next couple of hours, and we'll continue to update as we monitor our logs for any issues.

The status of the incident will be changed to monitoring once all the servers are restarted and we're confident there are no knock-on effects.

Posted Jan 26, 2016 - 11:59 EST

Update

We have all hands on deck trying to fix affected customers. We'll continue to provide updates as we know more.

Posted Jan 26, 2016 - 10:42 EST

Investigating

We're still seeing issues linger around message delivery. Currently investigating

Posted Jan 26, 2016 - 07:33 EST

Monitoring

We're watching for further issues, but believe everything to be resolved at this point.

Posted Jan 26, 2016 - 05:15 EST

Investigating

We're still seeing issues linger around message delivery. Currently investigating.

Posted Jan 26, 2016 - 03:53 EST

Update

Service should be restored as of now.

Posted Jan 26, 2016 - 03:27 EST

Monitoring

Services are returning online. Not quite ready to declare all clear, but things are looking healthier.

Posted Jan 26, 2016 - 02:21 EST

Update

Our internal upstream provider has identified an issue on their end and is investigating. We'll update as we get more information.

Posted Jan 26, 2016 - 01:00 EST

Investigating

Internal networking issues at olark are currently affecting chats.

Posted Jan 26, 2016 - 00:34 EST