For several hours on Tuesday, January 15th, Olark customers experienced intermittent issues toggling their chat availability status on chat.olark.com. We traced these issues to a recent code update, and ultimately reversed the update to restore reliable service.
If your business was affected by these issues, we sincerely apologize. We know that live chat is important to you and your customers, and that any interruption in our service has an impact on you. In this case, we didn’t provide the service you deserve, and that's not something we take lightly.
After any outage, we perform a detailed postmortem to understand what went wrong, what we can learn, and most importantly, what can we do to ensure it doesn’t happen again. We share summaries of every postmortem so that you can hold us accountable to improving our service in the future. You'll find a detailed summary of last Tuesday's incident, and our plans for improvement, below. If you have any additional concerns or questions, don’t hesitate to reach out to us through firstname.lastname@example.org.
Director of Engineering
On Monday, January 14th, 2019 at 8:00 PM UTC we shipped an improvement to our caching architecture tied to our internal services. Following the deploy, our monitoring indicated that the improvement was performing as expected.
On January 15th, 2019 at 2:41 PM UTC we received an automated alert that queues were backed up. We began investigating, and around 3:35 PM UTC, we identified the cause and began working on a fix. At 3:45 PM UTC we called a downtime, and shortly after at 4:01 PM UTC we had a fix merged and started rolling it out to production. Queues began dropping, and since we believed we had solved the issue, we moved to monitoring.
Several hours later, we received new reports of customers not able to change status on chat.olark.com. At 7:21 PM UTC, we called another downtime. Our engineering team started investigating the root cause, and around 7:30 PM UTC, we made the decision to deploy additional capacity to handle the increased load we were seeing on our internal services. However, the process of deploying additional capacity was slower than we'd anticipated, and ultimately introduced additional problems.
At 10:00 PM UTC we made the call to revert the caching architecture improvements to a known working version. Shortly after the changes were reverted, our internal services started to recover and began working through the backup on our queues.
As part of our ongoing efforts to improve our services, we implemented an improved caching architecture. The improvements involved an older part of our system, and the structure of that older code made implementing and testing the changes unusually complex.
While working towards releasing the change, we tested in our staging environment and ran load tests against the changes. During that testing everything appeared to be working. After we deployed the improvement and had real traffic at peak load, however, we realized that the change was causing some of our internal services machines to serve more requests than usual. As a result, those services were unable to keep up with the load. Our queues began backing up, which prevented users from performing certain tasks, including changing their chat availability status.
During the downtime, we were focused on fixing the new, improved caching architecture, rather than on rolling back the change. The engineering team also had concerns that rolling back under peak load would make the situation worse.
After every downtime, our engineering team meets to discuss what happened, identify the root cause(s) of the downtime, and generate a list of action items that will help us prevent the issue from happening again in the future. Some of the things that came out of our internal postmortem for this incident were:
– Rolling back to a known working version is always better than trying to fix an issue while we are in an active downtime. We took too long to come to that conclusion this time, and you, our customers, felt the impact. We can do better, and have implemented some checks to remove any subjectivity when we are initially investigating a downtime.
– Any changes we roll out need to have a well documented roll back plan, so we can quickly get back to the known working version.
– Our engineering team is prioritizing work to improve our internal services, which will make rolling out additional capacity much faster and more reliable.
Our commitment to you is that we can and will do better, and that we will continue to improve so that we can serve you better in the future.