Downtime incident, April 4 2021
We appreciate your patience with our team last week. In short: on April 4th, an issue affecting some customers persisted for about 18 hours and we did not detect and alert those customers early. Our team sets an expectation of identifying and resolving issues much more quickly than we did on April 4th; I'm sorry we failed to meet that expectation this time. In the spirit of transparency and trust, our engineering team put together a summary of what happened, the impact, and how we are preventing this from happening again in the future. Please read on for more details.
We're an open book, and welcome any further questions you have for our team at firstname.lastname@example.org. I'll be available there if needed for deeper discussion - thanks again for your patience.
On Sunday April 4th around 02:00 UTC, our engineering team made a series of upgrades to our inbound HTTP networking during a maintenance window. After initial monitoring, the upgrades appeared to be successful, and we closed the maintenance window.
What was the impact?
For about 18 hours (0:200 UTC - 19:50 UTC) on Sunday April 4th, a portion of our customers had issues loading the chat box on their website. This would have resulted in some of your visitors not seeing the option to chat with you during that time period. Once the issue was resolved, your customers would see the chat box immediately.
Why did it happen this way?
For any of our customers to experience an issue like this for almost 18 hours, and moreover for us to lack early alerting to this issue, is unacceptable in our view. Our engineering team spent time assessing root causes so we could transparently share:
- Our system configuration was missing one server during the upgrade. During prior system migrations, our team applied some manual fixes to smooth those migrations. These fixes were not included in our canonical configuration, which resulted in our cluster management being unaware of this asset server when upgrading our HTTP networking. Since the cluster was not aware, the HTTP upgrade on April 4th resulted in blocked traffic from one of our external load balancers to the asset server. This resulted in the CDN being unable to update from this server.
- Slow-building expiration hid the extent of the issue. The CDN expired these assets slowly, resulting in (initially) no strong signal of any issues. Despite engineering investigating a few early signals, it took a number of hours to affect enough customers before we recognized the issue.
- Upgrading systems during low-traffic weekend hid the extent of the issue. We usually try to do upgrades near normal business hours so that we can immediately see any big differences in traffic patterns. However we felt this upgrade might be more disruptive so we chose to make the tradeoff and schedule during the weekend. The low traffic of the weekend meant our alert thresholds didn't catch anything immediately out of the ordinary.
- Our monitoring lacked immediate visibility into this situation. While our system had some signals, we did not have alerting thresholds set properly to warn engineers early enough to resolve the issue early. In particular, we did not have alerts set for the errors that the CDN saw when trying to access the asset backend, and we did not have monitoring for out-of-normal traffic patterns on weekends.
What are we doing to prevent this in the future?
Our goal is to prevent issues, and where we can't prevent them we want to detect them early and resolve them quickly. To that end, we have immediately implemented changes to catch this issue, and issues like it, in the future:
- Continue to reduce reliance on manual system configuration. Our engineering team has been continuously migrating old infrastructure to a more automated system that presents less chance for human error (like the missing configuration we saw in this incident). As our team continues these migrations, we expect to eliminate this kind of issue.
- Improved monitoring for "out of normal" traffic from the chat box. We added monitoring to tell us when chat box load rates and chat volumes deviate from historical norms. This gives us a better signal for whether there are issues, even on the weekends with low traffic. We evaluated it against the April 4th downtime, and this new monitoring also would catch a similar issue of slow CDN expiration more than 10 hours earlier.
- Added monitoring for missing backends. Our system will now alert us immediately for the kind of situation where our CDN cannot properly access an internal service like we saw during this issue. We anticipate this might also catch similar issues much earlier than our existing alerting as well.
- Added customer support inbox to engineering checklist for off-hours maintenance. When we do maintenance off-hours (e.g. weekends like this one), we may not have our customer service team available to immediately flag issues. In these situations, our engineers should simply check the inbox directly to ensure no customers have reached out with potential issues to investigate. We plan to also add some automatic inbox checks to remind our team and ensure we catch hidden issues much earlier.
We know if our system fails, it impacts you trying to run your businesses and serve your own customers. We apologize for the impact caused by the incident described here, and hope that this explanation reassures you that we take seriously our commitment and dedication to your business. Again, if you have any further questions, please reach out!