Incident Alerted
Incident Report for Olark Live Chat

First and foremost, myself and our engineering team are sincerely sorry for preventing your teams from accessing chat on Thursday, November 8th. We understand you rely on Olark to communicate with your customers and that you put your trust in us to provide a stable service. We didn’t meet our end of that on the 8th and we hope to rebuild that trust.

Our team takes any outage very seriously. After any outage we perform a detailed postmortem to understand what went wrong, what we can learn, and most importantly, what can we do to ensure it doesn’t happen again.

Summary of incident

On Thursday, November 8th at 4:00 AM PST the SSL certificate being served on expired. This certificate was valid and owned by Olark, but had not been included in our most recent update; all other systems were serving our new SSL certificate. Unfortunately, our routine SSL certificate update process required human intervention for verification, and as a result of human error we failed to update the certificate on the domain.

The expired certificate caused browsers to display an insecure site error and prevented agents from logging in to chat. The issue was resolved at 8:07 AM PST.

The expired certificate did not affect the security of your data or website in any way.

What happened

October 22, 2018

  • New SSL certificates are configured and rolled out to Olark's infrastructure and load balancers. Manual verification begins to ensure all public endpoints are configured properly and serving the new certificate.
  • Due to human error, the endpoint was overlooked in verification. As a result, this portion of our system continues to serve the old SSL certificate. All other systems are correctly serving the new SSL certificate.

November 8, 2018

  • 4:00am PST: Our old SSL certificate expires. Since the endpoint was mistakenly serving the old certificate, agents were not able to log in to chat.
  • 5:20am PST Customer service flags our incident response team to investigate and resolve issue.
  • 6:03am PST Incident response team identifies the misconfigured load balancer. However, due to gaps in shared knowledge, the incident response team was unable to quickly identify the complete set of load balancers that needed the new SSL certificate. The incident response team was also unaware of important log messages that pointed to the specific issues preventing the load balancers from serving the new certificate.
  • 7:55am PST After consulting with engineers more familiar with the system, the incident response team was finally able to view the correct logs, identify the root cause, and trigger proper loading of the new certificate on the correct load balancers.
  • 8:07am PST Load balancers finish restarting and begins serving the new SSL certificate. Agents are able to log in and service is restored.‌

Fixing this problem took longer than is acceptable. Our incident response team lacked critical information. The engineers who had the specific knowledge necessary to isolate the issue were out of the office, so identifying the source of the problem took longer than it should have.

What we are doing to improve

We are taking several steps to follow up on this issue, both to make sure that this particular problem doesn't resurface, and to improve our team's ability to handle outages in general:

  1. Performing immediate re-validation of all public endpoints for Olark to ensure they are serving the correct up-to-date SSL certificates.
  2. Implementing automated SSL validation to reduce the risk and impact of human error.
  3. Improving our internal documentation for critical parts of our infrastructure.
  4. Implementing regular review of incident procedures, as well as running mock incidents for our team to practice with new systems and stay sharp.

Our commitment to you is that we can and will do better, and that we will continue to improve so that we can serve you better in the future. My sincere hope is we’ll regain the trust we lost on the 8th by showing you that we deliver on our commitments.

If you have any additional concerns or questions, don’t hesitate to reach out to us through

Nick Crohn
Director of Engineering

Posted 6 months ago. Nov 09, 2018 - 18:11 EST

This incident has been resolved.
Posted 6 months ago. Nov 08, 2018 - 13:12 EST
We've implemented a fix and things should be headed back to normal. Please give your a browser a refresh before logging in. Our apologies for the interruption.
Posted 6 months ago. Nov 08, 2018 - 11:14 EST
We're continuing to work on a fix for this issue. We're sorry for the interruption in service and we'll be sending an email to all of our customers as soon as this issue is resolved. We'll update with more info as we have it.
Posted 6 months ago. Nov 08, 2018 - 10:34 EST
We've received reports of a certificate issue making unaccessible. We're working to resolve this ASAP.
Posted 6 months ago. Nov 08, 2018 - 08:54 EST
We've detected an issue and are working to resolve this quickly. We'll have an update within the hour.
Posted 6 months ago. Nov 08, 2018 - 08:21 EST
This incident affected: