Summary of the Incident:
On January 29th around 10:30 am PST we began receiving reports that delivery of messages from agents to visitors were experiencing significant delays. We immediately investigated and determined that a component core to message delivery had become unstable. Components were restarted and chat was operating at full capacity again at approximately 2:30pm PST.
On the evening of January 29th and through January 30th some customers experienced residual issues with visitors appearing "stuck" in their visitor list. Our developers worked to manually clear these for customers as reports came in while also working on a more permanent fix.
On February 2nd, around 10am PST we began receiving new reports that message delivery was delayed in a similar fashion as on January 29th. Investigation revealed that the same component was experiencing issues again; the component was restarted and additional capacity added for redundancy during restarts and to aid in clearing message delivery queue backups quickly. Chat began operating at full capacity again at approximately 3pm PST on February 2nd.
The problem:
Components of our system that are normally very stable suffered a cascading failure that we're still finding a root cause for. We believe this was compounded by use of a relatively unsophisticated failover process, which we're now in the process of improving.
How we’re preventing this from happening in the future:
We're adding additional overhead and upgrading the cluster to use a more sophisticated failover mechanism that will make it more resilient in these types of circumstances.