Chatbox conflict with underscore library
Incident Report for Olark Live Chat
Postmortem

Summary of the incident.

On May 16th at 11:58 AM PT we began to slowly roll out an update to our chatbox client. By 12:49 PM PT the new code had been completely rolled out. Our support team started receiving a few reports of issues with the chatbox code. Customers were reporting that Underscore was no longer working due to our chatbox code. We immediately began to investigate and deemed it related to our deploy so rolled back the code. At 1:12 PM PT the code had been rolled back to its previous state. After the rollback we began to investigate why the deploy was causing issues.

The problem.

We found a dependency we had added to our code base was using the lodash library in a way that would overwrite the global variable of _. The _ variable is also used by underscore, another common Javascript library. In the cases where the _ variable existed on a site, _ would no longer be Underscore and would now instead be lodash, which is not interchangeable. Because of this incompatibility, websites using Underscore and Olark during this period may have experienced issues with any portions of their website that relied on Underscore-specific behavior.

What went wrong?

  1. Our previous build system safeguards were not adequate to catch the leaked global variable before it was deployed. The build system we use already has safeguards designed to catch these types of issues at build time, using a combination of shadowing common globals and JSHinting our final build. During Monday's build, the lodash dependency actually overwrote the globals in a way that could not be detected at build time, but rather only detectable at runtime.

  2. Recovery took too long. Although we were able to rollback immediately on detecting the issue, our build and redeployment was too slow (approximately 10min) and our 3 hour cache times meant that some browsers continued to run the old build longer than necessary.

  3. Customer communication wasn’t transparent enough. The customers we affected did not have immediate notification that something may be wrong with Olark, and spent significant time investigating the issue from their side. Our communication wasn’t as proactive as it should have been. While we did respond to those customers via Twitter and email, we were reluctant to update our status page with an issue that affected a small number, as this would have been sent out to all customers who subscribe to notifications. In hindsight, we believe we should have been more transparent up front about providing an update indicating we were investigating the issue.

How are we going to prevent this in the future?

We spent last week working on new processes and tooling to prevent this in the future, as well as improve our customer communication during and after production issues.

Completed changes
  1. New build testing, which addresses runtime checking of global variable leaks. This method boots Olark in a real browser environment on a real HTML page, and is far more accurate at identifying potential run-time issues that we previously could not catch at build-time. Improved build times by 6x, which will allow us to more quickly recover from issues and redeploy in the future if they should occur.

  2. Reduce cache times for loaders, which will reduce the amount of time that browsers retain cached copies of a bad deployment to 45 minutes (rather than 3 hours). We would like to reduce this time further (see below for upcoming changes).

  3. Locking down dependencies, for the time being we will be locking all versions and not allowing new dependencies to be added to the code base. This will be in effect until some of the upcoming changes can be put in place.

  4. Better customer communication with immediate and transparent updates via status.olark.com. In the future, we will be updating status.olark.com more proactively, particularly in cases where we are still investigating, to ensure our customers are aware of anything we think may be amiss.

Upcoming changes
  1. Firewalled chatbox code, this more closely approximates a sandboxed environment for our code which will better prevent any changes to our code base affecting your sites code. We are investigating options for accomplishing this using friendly iframes.

  2. Improved loader caching, we are investigating reducing the overall loader size, so we can further improve our loader cache time from 45 to under 5 minutes ), allowing quicker production rollouts.

Engineering details.

Our chat box code is bundled using the browserify library along with some additional custom shadowing of globals to further isolate Olark from the host page. This does a good job of encapsulating our code and safeguarding the host page; however it isn’t perfect. Unfortunately with Javascript there are still some ways to break this encapsulation.

The original problem during Monday's deploy stemmed from the lodash library included by one of our other dependencies, redux-persist. You can see the affected version of lodash with this issue here: https://github.com/lodash/lodash/issues/1852. Due to how lodash was being used, the global itself would not leak until runtime, which made it particularly difficult to detect with our current build safeguards before deployment.

We first addressed the build detection issue by adding new runtime safeguards to our build system, making use of PhantomJS to load the Olark chat box on a clean page and doing global variable detection both before and after the code is loaded and running. Before allowing a production deploy, we use this tool to verify our build so we can ensure our chatbox code isn’t polluting the page in which it’s loaded.

Then, after confirming that our new verification system caught this issue proactively, we were able to resolve the actual global leak by using a new version of redux-persist.

Posted May 26, 2016 - 10:45 EDT

Resolved
For a period of about one hour starting around 12:20pm PST, the deployed chatbox Javascript code contained an external dependency overriding the "_" variable in the global namespace, affecting some websites using the underscore library. We rolled back the changes immediately, and as of the rollback all new downloads are no longer affected by this issue. Previously loaded chatboxes may have contained a cached copy for up to 2 hours. Although we have safeguards in place to prevent this, today's issue was caused by a gap in automated testing of final builds. Future builds will now run an additional simulated runtime check for all globals before allowing a deploy.
Posted May 16, 2016 - 21:53 EDT