So, you have surpassed the decision of whether to choose a traditional RDBMS (SQL) store and have decided NoSQL is the way to go. You may have even read my previous post To Relate Or Not when making this decision. Now what do you do?
First Of All, Why NoSQL?
Much of the chatter I hear these days is around NoSQL. “My boss/architect says we should be using NoSQL for this project. I don’t understand why.” Or, “I want to use NoQL for xyz, but I don’t even know where to start!” This is primarily because, although this is changing rapidly, it is still a fairly foreign concept to most developers and architects. To someone like myself that is fortunate (in my opinion) enough to work for a company and on a team that is constantly looking forward to what advancements are around the corner that can improve our systems and increase the efficacy of our product, it seems like it’s been around for long enough that most people should be very familiar. But, honesty, while the concept has been around more-or-less since the dawn of computing, the NoSQL buzzword came-along and lit-up the minds of developers and technologists just a few short years ago.
“So, if the RDBMS’s that we have all grown to know so well work, why do we need to introduce something new?” Well, first let’s address that question. Yes, we have all become very familiar, if not intimate, with a RDBMS over the years. We are very comfortable and know how to make them store data and how to get that data back out when we need to. However, I do take issue with the “work so well” part of that question. We have, over the years learned all kinds of tricks to make a RDBMS fit our needs. But, it often is convoluted, complicated and comes at some expense. Either literal due to the cost of scaling-up hardware to meet performance demands, or mental and emotional as in the mental gymnastics you often need to perform to understand it and implement a solution against it. And it often is both.
RDBMS’s come from a time when the thought of storing terabytes of data was unheard of. Today, that is often just the entry-point for many data-driven applications. Then, you layer on-top of that the fact that we now need to develop of systems with an eye toward a global audience, meaning global distribution, replication and reliability. We are now well out-of bounds of the original purvey of the RDBMS.
Now, don’t get me wrong, I definitely feel that relational data has great value in the business world. And, to my last point of global distribution, there have been great strides made to make that less of an issue for RDBMS’s. (See MariaDb for a nice example) However, the hurdles we have been jumping through to make them work for us in all situations is just no-longer necessary. We live in a world of persistence choices. Choose the one that fits your needs best and run with it.
So Many Choices
The NoSQL world has exploded in recent years. And you have many many choices. There are options geared toward gargantuan write speeds, lighting fast reads, scalability, reliability, just about anything. And that, in my humble opinion, is both the bane and the beauty of the NoSQL world.
Which One Should I Choose?
As I mentioned previously, you should evaluate your needs and choose the solution that fits best. Easier said than done right? Yeah, well, you’re right. Especially if you are new to the arena. So, let me share a bit of my experience and hopefully that will help.
First, let me say that this post is already going to me too long. So, I am going to narrow the scope to the two front-runners of the NoSQL world at the time of this writing. Cassandra and MongoDB. Between them, they can fit most business needs. (They also happen to be the two on my companies “approved technologies” list!)
The first question you need to ask yourself is, what does my data look like? Or, if you are working on green fields then, what, at the very minimum, do you expect your data to look like? The first question that I like to ask is “What kind of data are you storing?”
- What is the “shape” of the data? (Contact info? Sales transactions? User activity?)
- How many different types of data? (See first bullet-point.)
- What volume do you anticipate? (It is usually best to overestimate here. You’ll be surprised.)
- Do you anticipate the load to be read-intensive, write-intensive or mixed?
- What size is the data you are storing? (By this I mean the individual bits of data.)
- Reference data: Contact information, billing information, etc.
- Transactional data: Banking, sales, tests etc.
- Activity: User behavior
|High write volume|
|Large number of reads|
|Large data objects|
Go ahead and fill-out the above spreadsheet to the best of your ability. The combination of this and your previous analysis of the type of data you expect will get you most of the way to your decision.
Evaluating The Checklist
If you are expecting your system to have a large number of writes, (this is obviously relative but I like to think first whether I expect it to be primarily recording data and reading infrequently) then you would likely be steered to cassandra. This is really cassandra’s historical “sweet spot”. You probably already know this.
On the flip-side, if you are expecting to write infrequently but read a lot, as in the case for contact information, MongoDB does have an out-of-thebox advantage here. However, as read load increases, so does read latency in MongoDB.
MongoDB will also give you an advantage when it comes to complex, dynamic queries on existing datasets. Mongo allows you to think less about the structure of your data up-front and decide how you want to retrieve that data later.
Large data objects are not really the forte of either of these databases. However, they do both have options that allow for chunking of large objects. With cassandra you have astyanax. In Mongo you have the option of going with GridFS. I have not personally used either. However, I have heard and read good things about both.
Lastly, if true ACID-compliant transactions are what you are looking for, you probably don’t want aNoSQL solution to begin with and should probably go back and read my post To Relate or Not. That said, if you are willing to loosen the reigns a bit on scrict ACIDity, either of these soltuions can provide you with a pretty high level of data consistency. And MongoDB does provide atomic transaction at a document level. [See here]
As I mentioned previously, while the type of data you are storing and the patterns of usage will and should be your first consideration when choosing a NoSQL solution, there really are other considerations you need to account for. Just a few examples:
- What are your requirements for availability?
- Do you anticipate requiring multi-dc or region replication?
- What is your plan for maintaining your data solution(s)?
I can say from experience and the experience of my close colleagues that when it comes to high-availability, nothing currently beats cassandra. And it is the only solution that I have come-across that allows for relatively seamless cross data center replication of clusters. (Other solutions, like Riak, provide this at a cost)
One often overlooked aspect of this whole picture is the cost of maintaining your NoSQL solution. If you are just looking at a few servers in one data center or AZ, this may not be much of an issue. As you begin scaling-out, you will find this becoming more-and-more of a burden on your team. I can say that the maintenance costs of a MongoDB cluster are likely to escalate at a much greater pace. And, if you decide that you need to scale to multiple datacenters/regions, this cost can become fairly astronomical. In our case we needed to hire a dedicated team of experts as well as consultants form 10gen. As for cassandra, we are currently running several instances cross-region and zone and these are fairly easily maintained by the development teams. These clusters are closely monitored via various tools and we rarely, very rarely, have any issues that require manual intervention.
My obvious bias
By now, I’m sure you can tell that I feel that cassandra is the superior solution for most any application you plan to implement that requires the benefits of a NoSQL database. That said, I don’t want to discount how great I think that MongoDB can be. I use it frequently for quick proof-of-concepts and small internal applications that will never require the kind of scalability the majority of my work demands. Not surprisingly, I particularly enjoy working with MongoDB when writing in Node.js. They are like peanut butter and jelly. And make the creation of full-stack applications quick and painless. But, watch-out if that application turns-out to be a big hit!
To quickly summarize, both MongoDB and cassandra offer excellent solutions to different problems out-of-the box. However, I believe that given the demands of todays globally distributed world of applications, the best solution for most applications is going to be cassandra. Yes, there is going to be a bit more up-front work required. Particularly if you are writing a system that is more read-intensive than write intensive. Out-of-the-box, this is not what cassandra is designed for. However, with a little thought as to how you design you data, thinking first of how it will be accessed/queried, you can achieve great performance on both reads and writes.
Again, all of this is based entirely on my own personal experience. I work in an arena where availability, scalability and global distribution is tantamount. This may not be the case for you. Use my above evaluation tools fairly and choose what fits your needs best. However, I can say that you are unlikely to ever be sorry you chose cassandra. And you very well may be a hero for doing so.