Now that it's been a few months since we built JavaScript Battle, I thought it might be interesting to list out some of JS Battle's major bugs/challenges: the causes, the effects, and how we handled each.
Timeline
To start things off, I'll review the JS Battle timeline:
In August 2014, I and a team of three other software engineers built JavaScript Battle (read about the process here).
In September 2014, I realized we wouldn't be able to scale to a large number of users with our current database (MongoLab's "Sandbox"), so I deployed a large Mongo replica set on Azure in order to be able to handle more battles and more users. (Note: it would have been far easier to simply upgrade our MongoLab subscription, but also far more expensive. I decided that deploying my own Mongo replica set on Azure would be an educational and MUCH more economical solution)
A larger database turned out to be a good idea (though the decision to deploy my own, perhaps not, as you'll see): We posted to Hacker News in October 2014 and were lucky enough to end up on the front page.
After ending up on the HN front page, we got a lot of new users (800+ as of this writing), and got to test our app at a much larger scale. Things went well for the most part, but, needless to say, there were a few big challenges along the way:
Bug/Challenge #1: The daily battles and the leaderboard occasionally wouldn't load
The problem:
Shortly after our successful "release" on Hacker News, we started to notice that our site would (very occasionally) stop loading the "battle replay" and "leaderboard" sections.
The temporary solution:
At the time, we solved this temporarily by restarting the app whenever this happened. Obviously, this wasn't a good long-term solution, but it kept things running smoothly until we could find some free time to investigate.
The cause:
After some testing, I realized that the ultimate cause was that our site's connection to our Mongo Replica Set on Azure was "going down" and NOT automatically reconnecting (even with the appropriate options set for the MongoDB package used in our application).
The long-term solution:
After a large number of attempted solutions (most related to setting different timeout and connection options in our app), I finally decided that there was no practical way to prevent this connection from occasionally dropping. As a result, I decided to write out a "safer", promise-based database connection wrapper. This didn't fix the dropped connections per se, but it DID handle them gracefully, with the result that it was no longer necessary to manually monitor our application.
Not perfect, but: PROBLEM SOLVED!
Bug/Challenge #2: Very occasionally, our battles would end in errors instead of finishing correctly
The problem:
This one popped up as we got more users. It wasn't an issue immediately, even for relatively high numbers of users, but eventually we were running enough daily battles that at least one of the battles wouldn't finish correctly.
The cause:
It turned out that our "gameRunner" (what we call the script that resolves the daily battles) was (you guessed it!) losing connection to our Mongo Replica Set on Azure.
The solution:
Given that I already knew this could happen, it was fairly easy to fix. I simply added the "safer" promise-based database connection wrapper to our game-runner.
PROBLEM SOLVED!
Bug/Challenge #3: As our user-base grew, our last few daily battles never finished running
The problem:
Each day, our "gameRunner" script randomly assigns each user to a long list of battles that will be happening that day (the architecture of the app is discussed in more detail here).
I noticed that each day there were a few games at the very end of the list that never seemed to finish.
The cause:
As our database grew, one particular query in our gameRunner was becoming more and more expensive, to the point that the battles wouldn't finish in a timely manner.
The solution:
Once the issue was identified, this one was easy: I simply added the appropriate index to our Mongo database, and the query started to run lightning-fast.
PROBLEM SOLVED!
Bug/Challenge #4: Users were occasionally unable to log in
The problem:
Occasionally, users are unable to log in to our site. Strangely, it doesn't seem to happen as often as we might expect (given the cause), which is good, obviously, but is also the reason it took "back seat" to some of our other bugs.
The cause:
Drum roll please...it was the database again. When you log into our site, we pull your user info using Mongoose (as opposed to the node MongoDB client, which we use for virtually every other database interaction).
For whatever reason, the dropped-database-connection issue seemed to be less of a problem for Mongoose (I imagine they're doing something clever that the MongoDB client isn't). However, the dropped connections WERE still an issue for Mongoose.
The solution:
Unfortunately, this particular problem was NOT solvable immediately via the "safe MongoDB wrapper" code, as that code is specifically applicable to MongoDB connections.
So this, currently, has NOT been resolved. Likely the easiest solution (and the one I plan to implement as soon as I find the time) is to switch everything on our site to use MongoDB, which would solve this issue because I could then easily use the "safe" wrapper. This may be better anyway, as it will make our site more consistent and potentially easier to follow.
PROBLEM (not yet) SOLVED!
Bug/Challenge #5: "THE BIG ONE"...Microsoft Azure "lost" the VMs that held our Mongo Replica Set. (This one was a HUGE issue).
The problem:
If you've tried to get on our website in the last 2ish weeks, you may have noticed that there was no website. If you looked closer, the only explanation was a cryptic "500" server error in the console.
The cause:
Due to a switch in the billing set-up on Azure, our Mongo VMs were automatically shut down for a short period of time. No problem.
However, when I attempted to restart the VMs on Azure, there was an issue: I couldn't start, shut down, restart, or interact in any way with my VMs (including SSHing, connecting on Mongo, etc). Azure outputted an error along the lines of "we couldn't find anything there". For all intents and purposes, it looked like my VMs had completely disappeared--as far as I could tell, Azure had "lost" the VMs.
The result of all this was that, with no VMs to run our Mongo Replica Set, we had no database, which in turn meant that our app couldn't connect, and, therefore, couldn't load.
To make matters worse, this happened right in the middle of both my finals week at grad school AND at the end/delivery period for a "side project" consulting gig. So deleting and recreating the VMs was out of the question time-wise (not to mention it wasn't something I wanted to do given that the solution would fix itself if Azure "found" the VMs again).
The solution:
The good news is that this is now fixed. I still don't know why the VMs went MIA after the billing switch (presumably, someone at Azure has some idea). But after ~2 weeks, the tech support team at Azure was able to resolve the issue and get the VMs back in contact. I started them up again yesterday, everything worked (no "VM not found" error messages), and I'm VERY pleased that JS Battle is officially back in action.
It's a big relief to get this, by far JS Battle's most significant issue to date, behind me.
Conclusions
As might be expected with any new application, we encountered our share of issues. It just so happened that almost all of ours were caused by or related to database issues. And, sadly, given the timing and lack of control we had over the last issue, our site was down for QUITE a while, which has signficantly impacted the number of active users on JS Battle.
In the future (and certainly for any "official" applications, consulting projects, etc), I think I'll plan to stick with managed database hosting (I used Compose with Heroku for a consulting gig in late Fall of this year, and this combo is REALLY easy to set up and very affordable).
At any rate, despite the bugs, this was a really fun application to build, and (partly because of the bugs) a very good learning experience. Check it out here.