So, you have surpassed the decision of whether to choose a traditional RDBMS (SQL) store and have decided NoSQL is the way to go. You may have even read my previous post To Relate Or Not when making this decision. Now what do you do?
First Of All, Why NoSQL?
Much of the chatter I hear these days is around NoSQL. “My boss/architect says we should be using NoSQL for this project. I don’t understand why.” Or, “I want to use NoQL for xyz, but I don’t even know where to start!” This is primarily because, although this is changing rapidly, it is still a fairly foreign concept to most developers and architects. To someone like myself that is fortunate (in my opinion) enough to work for a company and on a team that is constantly looking forward to what advancements are around the corner that can improve our systems and increase the efficacy of our product, it seems like it’s been around for long enough that most people should be very familiar. But, honesty, while the concept has been around more-or-less since the dawn of computing, the NoSQL buzzword came-along and lit-up the minds of developers and technologists just a few short years ago.
“So, if the RDBMS’s that we have all grown to know so well work, why do we need to introduce something new?” Well, first let’s address that question. Yes, we have all become very familiar, if not intimate, with a RDBMS over the years. We are very comfortable and know how to make them store data and how to get that data back out when we need to. However, I do take issue with the “work so well” part of that question. We have, over the years learned all kinds of tricks to make a RDBMS fit our needs. But, it often is convoluted, complicated and comes at some expense. Either literal due to the cost of scaling-up hardware to meet performance demands, or mental and emotional as in the mental gymnastics you often need to perform to understand it and implement a solution against it. And it often is both.
RDBMS’s come from a time when the thought of storing terabytes of data was unheard of. Today, that is often just the entry-point for many data-driven applications. Then, you layer on-top of that the fact that we now need to develop of systems with an eye toward a global audience, meaning global distribution, replication and reliability. We are now well out-of bounds of the original purvey of the RDBMS.
Now, don’t get me wrong, I definitely feel that relational data has great value in the business world. And, to my last point of global distribution, there have been great strides made to make that less of an issue for RDBMS’s. (See MariaDb for a nice example) However, the hurdles we have been jumping through to make them work for us in all situations is just no-longer necessary. We live in a world of persistence choices. Choose the one that fits your needs best and run with it.
So Many Choices
The NoSQL world has exploded in recent years. And you have many many choices. There are options geared toward gargantuan write speeds, lighting fast reads, scalability, reliability, just about anything. And that, in my humble opinion, is both the bane and the beauty of the NoSQL world.
Which One Should I Choose?
As I mentioned previously, you should evaluate your needs and choose the solution that fits best. Easier said than done right? Yeah, well, you’re right. Especially if you are new to the arena. So, let me share a bit of my experience and hopefully that will help.
First, let me say that this post is already going to me too long. So, I am going to narrow the scope to the two front-runners of the NoSQL world at the time of this writing. Cassandra and MongoDB. Between them, they can fit most business needs. (They also happen to be the two on my companies “approved technologies” list!)
The first question you need to ask yourself is, what does my data look like? Or, if you are working on green fields then, what, at the very minimum, do you expect your data to look like? The first question that I like to ask is “What kind of data are you storing?”
- What is the “shape” of the data? (Contact info? Sales transactions? User activity?)
- How many different types of data? (See first bullet-point.)
- What volume do you anticipate? (It is usually best to overestimate here. You’ll be surprised.)
- Do you anticipate the load to be read-intensive, write-intensive or mixed?
- What size is the data you are storing? (By this I mean the individual bits of data.)
The answers to the above questions can get you most of the way to your chosen solution. So, let’s examine that more closely.
What kind of data are you storing?
Ok, so there are many, many types of data out there. But they all tend to boil-down to a few types. These are just my own buckets.
- Reference data: Contact information, billing information, etc.
- Transactional data: Banking, sales, tests etc.
- Activity: User behavior
Reference data tends to be written infrequently but read often. It can be fairly complex with many different relationships (borrowing a term from the RDBMS world). You will also often need to look it up by various means. (Say, in the case of contact information, first name or telephone number, or…)
Transactional data, on the other hand, tends to be written frequently, and read less-frequently. It may be complex as in several operations make-up one transaction. But, it is fairly flat data that is, most-often, retrieved via some key like a transition number or order id.
Activity data is the new kid on the block. This type of data is often what constitutes everybody’s favorite buzzword “Big Data”. You are collecting massive amounts of data to attempt to mine it for trends. Trends that may help you present a better user experience. Or, to be completely honest, you hope that data will ultimately make you more money. this data is, almost without exception in my experience, unstructured. High volume and velocity. So, write-intensive. And read, for the most part (I’ll mention exceptions later) read in batches.
Now I Know My Data. Can I Just Pick A Solution Already?
Once you have identified the type or types of data you expect to be working with, you can begin to understand what kind of NoSQL solution will best fit your needs.
Below is a checklist that I often use to ell narrow this even further.
|High write volume
|Large number of reads
|Large data objects
Go ahead and fill-out the above spreadsheet to the best of your ability. The combination of this and your previous analysis of the type of data you expect will get you most of the way to your decision.
Evaluating The Checklist
If you are expecting your system to have a large number of writes, (this is obviously relative but I like to think first whether I expect it to be primarily recording data and reading infrequently) then you would likely be steered to cassandra. This is really cassandra’s historical “sweet spot”. You probably already know this.
On the flip-side, if you are expecting to write infrequently but read a lot, as in the case for contact information, MongoDB does have an out-of-thebox advantage here. However, as read load increases, so does read latency in MongoDB.
MongoDB will also give you an advantage when it comes to complex, dynamic queries on existing datasets. Mongo allows you to think less about the structure of your data up-front and decide how you want to retrieve that data later.
Large data objects are not really the forte of either of these databases. However, they do both have options that allow for chunking of large objects. With cassandra you have astyanax. In Mongo you have the option of going with GridFS. I have not personally used either. However, I have heard and read good things about both.
Lastly, if true ACID-compliant transactions are what you are looking for, you probably don’t want aNoSQL solution to begin with and should probably go back and read my post To Relate or Not. That said, if you are willing to loosen the reigns a bit on scrict ACIDity, either of these soltuions can provide you with a pretty high level of data consistency. And MongoDB does provide atomic transaction at a document level. [See here]
As I mentioned previously, while the type of data you are storing and the patterns of usage will and should be your first consideration when choosing a NoSQL solution, there really are other considerations you need to account for. Just a few examples:
- What are your requirements for availability?
- Do you anticipate requiring multi-dc or region replication?
- What is your plan for maintaining your data solution(s)?
I can say from experience and the experience of my close colleagues that when it comes to high-availability, nothing currently beats cassandra. And it is the only solution that I have come-across that allows for relatively seamless cross data center replication of clusters. (Other solutions, like Riak, provide this at a cost)
One often overlooked aspect of this whole picture is the cost of maintaining your NoSQL solution. If you are just looking at a few servers in one data center or AZ, this may not be much of an issue. As you begin scaling-out, you will find this becoming more-and-more of a burden on your team. I can say that the maintenance costs of a MongoDB cluster are likely to escalate at a much greater pace. And, if you decide that you need to scale to multiple datacenters/regions, this cost can become fairly astronomical. In our case we needed to hire a dedicated team of experts as well as consultants form 10gen. As for cassandra, we are currently running several instances cross-region and zone and these are fairly easily maintained by the development teams. These clusters are closely monitored via various tools and we rarely, very rarely, have any issues that require manual intervention.
My obvious bias
By now, I’m sure you can tell that I feel that cassandra is the superior solution for most any application you plan to implement that requires the benefits of a NoSQL database. That said, I don’t want to discount how great I think that MongoDB can be. I use it frequently for quick proof-of-concepts and small internal applications that will never require the kind of scalability the majority of my work demands. Not surprisingly, I particularly enjoy working with MongoDB when writing in Node.js. They are like peanut butter and jelly. And make the creation of full-stack applications quick and painless. But, watch-out if that application turns-out to be a big hit!
To quickly summarize, both MongoDB and cassandra offer excellent solutions to different problems out-of-the box. However, I believe that given the demands of todays globally distributed world of applications, the best solution for most applications is going to be cassandra. Yes, there is going to be a bit more up-front work required. Particularly if you are writing a system that is more read-intensive than write intensive. Out-of-the-box, this is not what cassandra is designed for. However, with a little thought as to how you design you data, thinking first of how it will be accessed/queried, you can achieve great performance on both reads and writes.
Again, all of this is based entirely on my own personal experience. I work in an arena where availability, scalability and global distribution is tantamount. This may not be the case for you. Use my above evaluation tools fairly and choose what fits your needs best. However, I can say that you are unlikely to ever be sorry you chose cassandra. And you very well may be a hero for doing so.
Some interesting related links