Ok, this might seem like an odd title for a tech blog. But, let me explain. This is the second part of my series on current database choices at my organization and how to pick the “right one” for your application. (If you didn’t read the introduction, check-out Tackling a new world order for some background.) And in my mind, the decision for most developers today starts with whether they should stick with what they are familiar with. Which is in most cases a RDBMS like Oracle or SQL Server. Or, go with the new cool kid on the block. (i.e. NoSQL flavor of your choice.) To further explain the title; I tend to define the choice as between traditional RDBMSs and NoSQL. Rather than SQL or NoSQL. Mainly because I view SQL as a language to retrieve data, not necessarily how the database operates. (For right or wrong.)
Relational databases have been around for quite a while. As a result, they have evolved over time to solve many traditional organizational data problems very well. There are a plethora of tools to work with different RDBMS solutions that are currently either lacking or foreign in nature in the NoSQL realm. In addition, most relational databases that have survived, are very mature and well-vetted. You can count on them to “just work” in the way they are designed to be used. As a result, many developers choose to use relational stores because they are comfortable and reliably predictable. There is certainly some validity to this reasoning. However, this absolutely cannot be your only reasoning for making this choice in the world we (meaning software architects and developers) live in today.
The number one reason cited, and rightfully so, to choose an RDBMS over a NoSQL option is an absolute requirement for full ACID compliance. (if you need a crash course on ACID. See this wikipedia write-up.) Essentially, if you need your data to always be consistent across all requests, you are currently best served by an RDBMS solution. For example, an online purchase must debit the purchaser’s account prior to debiting the seller’s. Otherwise, you can create chaos.
Relational databases are also a great fit for small datasets that are likely to remain fairly stable. HR data is a commonly supplied example. No matter how fast your future start-up is likely to grow, it is not likely to grow at such a rate that you need the high-velocity read and write capabilities of a NoSQL option. In addition, the type or shape of HR data is highly unlikely to vary much over time.
In addition, relational databases allow you to decompose data into distinct parts to reduce duplication. (Best known as data normalization) This has several advantages beyond the simple fact that you can store your data in less space. Decomposing and normalizing data allows the developer to retrieve parts of data easily. And perform various bits of complex analysis on this parts. One increasingly popular use-case is to pump subsets of data extrapolated from other stores such as Hadoop into RDBMSs for slicing and dicing that data in ways made simple with the power of SQL.
Why not RDBMS?
The biggest driver for companies to move away from traditional RDBMS solutions is pure scale. In particular, with the rapid adoption of cloud technologies to provide elastic computing environments. Scaling relational data stores has traditionally meant vertically scaling. That is, adding more computing power to existing machines. In the world of the cloud, scaling is most-efficiently done by scaling-out. Adding more nodes to a cluster to distribute load. This has many advantages over traditional scale-up options. Most notably, that it is very difficult to add computing power, be it CPU cores, memory, faster drives, etc. to a server. It is very easy to add another node to a cluster. Particularly in the world of cloud computing where it can be as simple as a few mouse clicks. However, even in a datacenter, it is easier to add another, less expensive server and add it to an existing cluster.
This is not to say that scaling-out a relational database is impossible. It can and has been done. However, it is historically VERY difficult and prone to many issues.
The typical solution to scaling an RDBMS is called sharding. This is where you distribute the data across nodes based on a pre-determined key. The often-cited simplistic example is sharding users alphabetically by last name. So, given two nodes, you would have A-M on one node and N-Z on the other. However, this is really a bad example as names are not evenly distributed across the alphabet and likely will cause hot-spots. And, ironically, this common example illustrates how difficult it can be to choose the correct key to shard a database with. If you do find yourself in the situation where you need to shard your RDBMS, I hope you can find a decent key to shard on or, alternatively, can use an auto-shard solution that shards on a generated key. Otherwise, good luck! That said, I don’t want to make this blog post a dissertation on sharding so, I will leave that subject here. There are many, many articles available for you to research sharding further if you so choose.
The second reason that you might want to choose a NoSQL solution is the pure size of the data that you expect to be working with in your application. One of the greatest strengths of NoSQL databases are their ability to handle very high volumes of data. RDBMSs are bound by their dependence on ACID compliance. The I in ACID stands for Isolation. Which means that a transaction against a relational database requires a lock on the data which causes all other operations on that same piece of data to queue until the operation completes. In contrast, most NoQL databases adhere to the BASE model which favors availability, “soft state” and eventual consistency over the strict model of Consistency and isolation enforced by the ACID model.
NoSQL databases achieve this through highly-distributed models that allow reads and writes to succeed even in the face of high concurrency. The trade-off is that data consistency is relatively lax in that if one application is performing a modification to a record that another application is reading, there is a chance that the application reading the data will not get the most-current state of that data as defined by the state set by the application modifying the data. To keep the reading application from being blocked by the writing application, it will be directed to a different node. That node will be eventually updated with the modification made by the first application. But, there is still a period of time when the nodes state is out-of-sync. This would not occur in an ACID compliant system. However, in most use cases, this is perfectly fine.
One more common use-case for NoSQL data stores is where the schema of the data to be used is either not known in advance or expected to be variable. A major impetus behind the explosion of NoSQL databases, particularly among developers, is the fact that they tend to be “schemaless”. This means that a developer can begin developing a system without knowing the exact data requirements. This can greatly increase time-to-market as an application can be deployed with a subset of data and fields can be added as they are determined to be required. This flexibility is tremendously powerful and would be very painful if not impossible in the relational world.
Basically, any application could make either a relational or NoQL database work for them. However, you really should use the right tool for the job. Just because you may be comfortable with, or even an expert at, SQL and RDBMS data stores you should not choose it for your application if it doesn’t fit. Ultimately, if you make the wrong choice, you will pay for it. Now or later.
So, in summary, here is my basic advice for when to choose a relational database or NoSQL.
- Small, well-defined, dataset.
- Relatively small data velocity.
- Data absolutely must for consistent across all reads.
- No way around joins.
- Global scalability can be sacrificed or you have significant resources to manage a scaling option.
- Large volume of data.
- High velocity data ingestion or consumption.
- Multi-zone, multi-region scalability is required.
- Data can be modeled in such a way that joins are not required.
Examples: Activity streams, monitoring data, any data requiring global availability
Hopefully this helps add some clarity to the decision to be made between a traditional RDBMS or a NoSQL database. Obviously, there is not enough room for a comprehensive study of the subject. But this should give most users a good starting-point.
In the next part of this series I will compare Cassandra to MongoDB and explain use-cases for each of them.