How One Missing `var` Ruined our Launch

  Well, that was a veritable shitstorm (sorry for the language).  Long story short, Safe Shepherd was featured today on TechCrunch (along with other 500Startups companies, also on VentureBeat, Forbes, …) and everything broke all at once.  Every. little. thing.  We had rolled out a huge change to MelonCard over the last few days to make our site a seamless “everything just updates” look-good / feel-good product using NodeJS long-polling with a slick KnockoutJS dynamic jQuery Templates front end.  We did our due diligence of manual and unit testing, mixed with a full suite of Vows for Node.  All systems check, full steam ahead, right?  Not so fast.

  Our system in NodeJS takes input from a user describing his state, say “I am waiting on these two records to be updated,” and the server (based on a timestamp check) returns back either “Your records are up to date” or “Record xxx has changed to yyy.”  (It’s actually a bit more complicated with Redis shared variables, sessions and other security checks to interface between Rails, mySQL, Redis, and Node).  It’s beautiful simple, but even simple code in NodeJS can be hell when things don’t go quite as planned.  That happened today.

  After the articles dropped today, we faced a moderate load of excited users (let’s say 50-100 new users over an hour period).  All of a sudden, everything went haywire.  No pages worked any more; our inboxes started to fill up with “Your product just hangs.”  I grabbed a coffee and readied for battle.

  My first thoughts: NodeJS handles load well; it’s what it’s known for.  Fifty or one hundred users couldn’t possibly take down my system.  And in support of Ryan Dahl, it wasn’t Node’s fault, per se (more on that later).  The server started returning ridiculous results.  The user would say “I have records a, b and c” and the server would say “You idiot, delete records x, y and z, and here’s records a, b and c.”  Even getting this issue specifically scoped out and reproducible was nigh on impossible with Node’s poor error handling and debugging.  The following unix command came up a lot (and yeah, I did it on production):

NODE_ENV=’production’ node/privacy.js | grep “Returned results”

You can imagine how horrendous trying to sort out those results had been.  That said, everything on Staging and unit testing was still working perfectly, and I had not a leg to stand on.  On top of that, again, our system does some extensive session checking (for security), and so as upset users logged in and out on different tabs, there were a ton of “Unauthorized” errors bubbling up (that muddled any attempt to see if real errors were being hit).  When I did get errors listed, they looked like this:

Trace:   at EventEmitter.<anonymous> (/—/node/privacy.js:118:11)  at EventEmitter.emit (events.js:81:20)

The offending line (or the only one Node would report back to me):

process.on(‘uncaughtException’, function (err) {   console.log([‘Caught exception’, err]);   console.trace(); });

  At least the app wouldn’t crash, but again: not a single damn leg to stand on.  I would posit here that nothing I could do in best practice (manual front-end testing, unit testing, error handling, etc.) would have caught the offending line.  Yes, I should have load testing, but I wouldn’t feel secure that it would catch the race condition that had ensued.
  After digging around about 4 hours (and putting my page into 503 – Temporarily Unavailable, and while my co-founder responded personally to every upset or inquisitive e-mail and reached out to all of our new users with an apology), I realized the server was confusing my request’s input (request params) with another random users request params.  I’ll say right now to be clear, the server was designed to and did only return YOUR information for YOUR request, but it got confused what YOU had asked for.  That is, you said “I like apples and melons” and it said “No silly, you’re a mango kind of guy.”  So, everything was safe (security-wise), but still plain wrong.  Why would my ExpressJS server get confused about what I had asked.  I started digging and found this:

app.all(‘/apps/:user_id/status’, function(req, res, next) {

   // …

  initial = extractVariables(req.body);

});

Seem kinda bad?  It’s a damn tragedy.  I’m not an expert in JavaScript, but let me explain it as best I can (or maybe see here).  In JavaScript, you declare variables either in the function scope or the global scope (with some intricacies on walking up the scope chain for declaration / reference until global scope).  When I created initial without using var, it walked up the scope chain until it quickly reached global scope and created a global variable initial.  When the next request came in, it walked up the same chain and wrote over the variable (the one the other request was still hoping to use).  And so on and so on for each request.  When the server tried to respond to each request after some later processing, it read from that constantly-being-truncated variable, and returned cruddy results.  Sheer crap.  I had needed to write this:

var initial = extractVariables(req.body);

  That would have created the variable in the scope of my anonymous function, and thus there would be no chance of later requests truncating that value.  It was an amateur mistake, but one that was completely invisible to any debugging or testing I could have done that didn’t explicitly look for race / concurrency issues.

  Now is the point I’d like you to say: you should have used CoffeeScript, and hey TameJS while you’re at it.  You’d be right.  I wanted to know the exact methods of NodeJS for my first run, but this was an unnecessary loss for my company.  In other circumstances, it could have been much worse (what if I had misscoped the session variable?).  On top of that, the lack of really good error handling (in Rails, we track errors by backtrace and e-mail the team unique errors) and any real debugging (relying on grep and less) made me feel like I was in the land before development being nice.  Or maybe, I should just have been more careful.

  After 4 hours of down time and several hundred upset users, I had found the issue and quickly pushed our fix to production.  The clouds cleared up, birds flew in the sky and the sun came out.  We began to reach back out to users with an apology, counted the loss, and moved on.  But, it still makes me feel uneasy what damage was caused by missing a single keyword.  Am I a bad person for missing one var?

-Geoff