Over the past few weeks, you have likely noticed some repeated issues with Olark uptime and performance. I and our engineering team hold our product to a higher standard of stability and reliability, and I don't believe we have met that standard over the past 3 weeks. We're sorry for that. I'd like to take some time here to briefly summarize the impacts, what happened behind the scenes, and what we have done to fix the root cause of these issues.
Our team takes any outage or performance issue very seriously. After repeated production issues like this, we perform a detailed postmortem to understand what went wrong, what we can learn, and (most importantly) what we can do to ensure it doesn’t happen again.
Summary of recent incidents
Here is a brief timeline of customer impact and actions taken by our team:
- Sept 10: For a period of 1 hour, chat login was intermittent, some chat boxes were not loading on customer websites, and chat performance was slow. Our engineering team implemented mitigation measures while identifying root-cause.
- Sept 11: For a period of 20 minutes, similar behavior was observed; our engineering team continued to use mitigation measures and identify root-cause.
- Sept 16: For a period of 5 hours, similar behavior was observed. The duration was exacerbated by an unexpected surge of chat traffic and new agent activity from a single customer. Our engineering team continued to use mitigation measures and identify root-cause. Additional early-warning monitoring was added to our system to catch symptoms much more quickly.
- Oct 2: For a period of 20 minutes, similar behavior was observed; our engineering team continued to use mitigation measures and identify root-cause. Fixes implemented and deployed for potential root-cause.
- Oct 8: For a period of 2 hours, similar behavior was observed; our engineering team stabilized the system with mitigation measures. During this time, we stopped showing non-chatting visitors to agents for a few hours; this ensured chatting visitors could continue without interruption. After this incident, we finally had clear evidence of root-cause. Fixes were implemented and deployed.
We believe the primary impact during each incident was chat being unavailable to some customers and some agents struggling to log in. No messages were lost, although delivery was delayed in many cases; in some cases the delivery delay may have been severe enough to prevent normal conversation.
Obviously there is a lot of repetition in here, which is not what we want to see. Although our team was able to use mitigation measures to stabilize the system each time, the root cause was complex and difficult to identify. We dedicated a 3-engineer team to root-causing and implementing fixes from September 16th until the final (successful) fix on Oct 8th.
In brief: our engineering team embarked on upgrades of two major chat-related systems (core user accounts & core message delivery) during August and September. These newly upgraded systems had specific issues that interacted to create a mode where the system would build a "wave" of traffic and become stuck in an overwhelmed state.
Here is a more technical overview of what happened behind the scenes:
- End of July: we made a significant upgrade and migration of our user account service, which underlies most systems at Olark from login to chat routing. This was done after significant QA and preparation in June, and appeared to be a smooth migration.
- During August: we monitored newly migrated user account service for performance issues and bugs. Weekly improvements and fixes were released.
- Early September: we identified and fixed a case where some (non-critical) timestamp metadata was not being updated properly on user accounts. During this time, we also began upgrades and simplifications to a core message delivery system.
- Mid-September: incidents (described in timeline above) begin occurring. Our team implements key mitigation measures for recovery while we investigate root cause.
- Early October: root cause is identified. The non-critical timestamp fix in the user account service had a severe performance degradation for accounts with high numbers of agents and groups, due to over-saving the data and cache invalidations. This slowdown would then interact with the upgrades to core message delivery, which had a case where slowdowns could eventually build a very large "wave" of traffic that overwhelmed the rest of the system. Once the system was overwhelmed, it would take 10-20 minutes to recover.
What we are doing to handle this better in the future
Identifying and fixing the root of these repeated issues took far longer than we consider normal. Although we dedicated 3 engineers to this, the root cause (and final fix) was not fully resolved until more than 3 weeks after the initial incident.
We are taking several steps to to make sure that this particular problem doesn't resurface, and to improve our team's ability to handle future issues quickly:
- Improved performance for core user account service. We began batching updates for agents and groups within accounts. This dramatically improved the performance for large accounts that triggered and exacerbated the issues we saw during this time.
- Protected core message delivery from future slowdowns in the system. We added a status that would allow the messaging system to recognize immediately when certain resources were unavailable. This prevents the buildup of "waves" of traffic and ensures performance degradation elsewhere cannot overwhelm the system.
- Added a new mitigation tool allowing better chat service during recovery. Previously, when the system became stuck we would have to stop the flow of both chatting and non-chatting visitors during a period of recovery. Our new tooling allows chatting visitors to continue interacting with agents during recovery.
- Improved "early warning" monitoring for performance degradation. In this process of upgrading systems and resolving issues, we identified new key indicators of system performance and stability. We now have these graphs in front of the on-call engineer at all times, with thresholds for alerting to catch issues before customer impact.
We're always committed to improving transparently, so we can serve you better in the future. My hope is this helps restore and maintain the trust we've built with you over the years.
If you have any additional concerns or questions, please don’t hesitate to reach out to us through firstname.lastname@example.org.