Hammers and nails: selecting a datastore

For many years, particularly for web applications, the only real option was the ol’ RDBMS. Their primary query language, SQL, is possibly taught more than any other language. Recently, a number of alternatives to the traditional RDBMS, collectively termed NoSQL, have become rather popular. Boasting horizontal scalability and incredible speed, open-source databases like MongoDB, CouchDB, Cassandra, Riak, etc, have attracted much curiosity within the development community, and more projects are started with a NoSQL backend rather than a traditional RDBMS like MySQL.

I wish I could remember the exact quote, but my friend Kevin once said something along the lines of, “every time I go to a hammer conference, everything looks like a nail project.” With all the neat new datastores out there, it’s easy to find a reason to use one. Indeed, some NoSQL fanboys take it a bit too far. So, what datastore should you use for your project? Below is a series of musings I’ve gleaned from a number of conversations in the past few months. As a disclaimer, most of these musings are anecdotal.

RDBMSes are a huge pain to scale. One of the core bits of a relational model is a join, and scaling a join beyond a single machine just doesn’t work. NoSQL databases, by definition, don’t do joins, so they don’t have this problem. Scaling a NoSQL database is just a matter of sharding the keyspace, which is relatively-natural given the structure of NoSQL databases, and some already manage the keyspace for you (e.g. Cassandra).
RDBMSes are incredibly stable. These things have been around for many years, and consequently, their performance has been well-studied and optimized. For the most part, assuming you’re good about defining your data models and not trying to do too much (like a complicated replication scheme), they just work. A lot of NoSQL projects, on the other hand, particularly newer ones, can be unstable or at least inconsistent. Why is it that 999 writes happen in <= 20ms, but the 1000th takes 4 seconds? Another potential issue is the fact these projects aren’t as mature as many SQL-based databases. Bugs exist, and there’s nothing more frustrating than realizing that a particular problem is not your fault but your underlying datastore’s.
NoSQL databases are fantastic for new projects. My favorite bit about CouchDB (my NoSQL database of choice) is that you don’t have to define a schema before you can start adding data. One of the most annoying things about a SQL database is setting up the tables you’d like to use and the relationships among them. I don’t care if the identification numbers are BIGINTs or INTs. I don’t care if my unique key is less than a certain number of bytes. I’m just making a silly little webapp. I want to figure that stuff out later, if at all! CouchDB just stores arbitrary JSON, so I can just add little blobs of data and push off deciding on a schema. Such a model allows incredibly fast iteration, particularly on new projects where you haven’t quite figured out what you want yet.
NoSQL databases require an expert. As I mentioned earlier, many NoSQL projects are not very mature. Consequently, if you’re going to deploy one in production, you need someone who knows the project and its history top to bottom. With patches and new features often come breaking changes and potential data corruption, and you need someone full-time thinking about your datastores and how to maintain them. Typically, these experts are few and far between, and if they leave, you’re hosed. In contrast, operations folks who can manage MySQL databases are plentiful, and if one quits, you can find another without much trouble. There are also only so many unique RDBMS deployments, whereas a NoSQL deployment is naturally going to be unique to your particular situation.

Obviously, there are caveats and counterexamples to everything I’ve said, and I’m fairly certain that given the right audience, I will have sparked a flamewar and offended everyone. Again, these are musings, not proofs, and this is a blogpost, not a research paper. Long story short, and this is going to sound like a cop out, but if your project is a screw, use a screwdriver. If it’s a nail, use a hammer. If convenience and ease-of-use is a priority (at the potential cost of consistency and/or data loss), then go ahead and use a NoSQL datastore. If you don’t want to think too hard about maintaining your datastore or worry about data corruption, go with an RDBMS. In general, I’d say that unless your data model really, really, really fits one of the NoSQL models, you won’t be utilizing the efficiency benefits of such a database, and you’re better of just going with a SQL database and avoiding the extra pain of maintaining NoSQL.

Companies like Google, Facebook, and Amazon are big enough that they require particular datastores to meet their scalability and capacity needs. They’ve invested a lot of time and money into building and maintaining databases that work. Many of these data models have been open-sourced, but just because you have the code doesn’t mean you’re capable of maintaining it in a 99% uptime production environment. As these new datastores mature, so will the discussion about which datastore to use, but for now, I’ll likely err on the side of project maturity for anything where long-term reliability is critical.

Flamewar, commence!