So, you chose to go NoSQL?

So, you have surpassed the decision of whether to choose a traditional RDBMS (SQL) store and have decided NoSQL is the way to go. You may have even read my previous post To Relate Or Not when making this decision. Now what do you do?

First Of All, Why NoSQL?

Much of the chatter I hear these days is around NoSQL. “My boss/architect says we should be using NoSQL for this project. I don’t understand why.” Or, “I want to use NoQL for xyz, but I don’t even know where to start!” This is primarily because, although this is changing rapidly, it is still a fairly foreign concept to most developers and architects. To someone like myself that is fortunate (in my opinion) enough to work for a company and on a team that is constantly looking forward to what advancements are around the corner that can improve our systems and increase the efficacy of our product, it seems like it’s been around for long enough that most people should be very familiar. But, honesty, while the concept has been around more-or-less since the dawn of computing, the NoSQL buzzword came-along and lit-up the minds of developers and technologists just a few short years ago.

“So, if the RDBMS’s that we have all grown to know so well work, why do we need to introduce something new?” Well, first let’s address that question. Yes, we have all become very familiar, if not intimate, with a RDBMS over the years. We are very comfortable and know how to make them store data and how to get that data back out when we need to. However, I do take issue with the “work so well” part of that question. We have, over the years learned all kinds of tricks to make a RDBMS fit our needs. But, it often is convoluted, complicated and comes at some expense. Either literal due to the cost of scaling-up hardware to meet performance demands, or mental and emotional as in the mental gymnastics you often need to perform to understand it and implement a solution against it. And it often is both.

RDBMS’s come from a time when the thought of storing terabytes of data was unheard of. Today, that is often just the entry-point for many data-driven applications. Then, you layer on-top of that the fact that we now need to develop of systems with an eye toward a global audience, meaning global distribution, replication and reliability. We are now well out-of bounds of the original purvey of the RDBMS.

Now, don’t get me wrong, I definitely feel that relational data has great value in the business world. And, to my last point of global distribution, there have been great strides made to make that less of an issue for RDBMS’s. (See MariaDb for a nice example) However, the hurdles we have been jumping through to make them work for us in all situations is just no-longer necessary. We live in a world of persistence choices. Choose the one that fits your needs best and run with it.

So Many Choices

The NoSQL world has exploded in recent years. And you have many many choices. There are options geared toward gargantuan write speeds, lighting fast reads, scalability, reliability, just about anything. And that, in my humble opinion, is both the bane and the beauty of the NoSQL world.

Which One Should I Choose?

As I mentioned previously, you should evaluate your needs and choose the solution that fits best. Easier said than done right? Yeah, well, you’re right. Especially if you are new to the arena. So, let me share a bit of my experience and hopefully that will help.

First, let me say that this post is already going to me too long. So, I am going to narrow the scope to the two front-runners of the NoSQL world at the time of this writing. Cassandra and MongoDB. Between them, they can fit most business needs. (They also happen to be the two on my companies “approved technologies” list!)

The first question you need to ask yourself is, what does my data look like? Or, if you are working on green fields then, what, at the very minimum, do you expect your data to look like? The first question that I like to ask is “What kind of data are you storing?”

  • What is the “shape” of the data? (Contact info? Sales transactions? User activity?)
  • How many different types of data? (See first bullet-point.)
  • What volume do you anticipate? (It is usually best to overestimate here. You’ll be surprised.)
  • Do you anticipate the load to be read-intensive, write-intensive or mixed?
  • What size is the data you are storing? (By this I mean the individual bits of data.)
The answers to the above questions can get you most of the way to your chosen solution. So, let’s examine that more closely.
What kind of data are you storing?
 
Ok, so there are many, many types of data out there. But they all tend to boil-down to a few types. These are just my own buckets.
  • Reference data: Contact information, billing information, etc.
  • Transactional data: Banking, sales, tests etc.
  • Activity: User behavior
Reference data tends to be written infrequently but read often. It can be fairly complex with many different relationships (borrowing a term from the RDBMS world). You will also often need to look it up by various means. (Say, in the case of contact information, first name or telephone number, or…)
Transactional data, on the other hand, tends to be written frequently, and read less-frequently. It may be complex as in several operations make-up one transaction. But, it is fairly flat data that is, most-often, retrieved via some key like a transition number or order id.
Activity data is the new kid on the block. This type of data is often what constitutes everybody’s favorite buzzword “Big Data”. You are collecting massive amounts of data to attempt to mine it for trends. Trends that may help you present a better user experience. Or, to be completely honest, you hope that data will ultimately make you more money. this data is, almost without exception in my experience, unstructured. High volume and velocity. So, write-intensive. And read, for the most part (I’ll mention exceptions later) read in batches.
Now I Know My Data. Can I Just Pick A Solution Already?
 
Once you have identified the type or types of data you expect to be working with, you can begin to understand what kind of NoSQL solution will best fit your needs.
Below is a checklist that I often use to ell narrow this even further.
Feature Yes No
High write volume
Large number of reads
Complex queries
Large data objects
ACID transactions

Go ahead and fill-out the above spreadsheet to the best of your ability. The combination of this and your previous analysis of the type of data you expect will get you most of the way to your decision.

Evaluating The Checklist

If you are expecting your system to have a large number of writes, (this is obviously relative but I like to think first whether I expect it to be primarily recording data and reading infrequently) then you would likely be steered to cassandra. This is really cassandra’s historical “sweet spot”. You probably already know this.

On the flip-side, if you are expecting to write infrequently but read a lot, as in the case for contact information, MongoDB does have an out-of-thebox advantage here. However, as read load increases, so does read latency in MongoDB.

MongoDB will also give you an advantage when it comes to complex, dynamic queries on existing datasets. Mongo allows you to think less about the structure of your data up-front and decide how you want to retrieve that data later.

Large data objects are not really the forte of either of these databases. However, they do both have options that allow for chunking of large objects. With cassandra you have astyanax. In Mongo you have the option of going with GridFS. I have not personally used either. However, I have heard and read good things about both.

Lastly, if true ACID-compliant transactions are what you are looking for, you probably don’t want aNoSQL solution to begin with and should probably go back and read my post To Relate or Not. That said, if you are willing to loosen the reigns a bit on scrict ACIDity, either of these soltuions can provide you with a pretty high level of data consistency. And MongoDB does provide atomic transaction at a document level. [See here]

Other considerations

As I mentioned previously, while the type of data you are storing and the patterns of usage will and should be your first consideration when choosing a NoSQL solution, there really are other considerations you need to account for. Just a few examples:

  • What are your requirements for availability?
  • Do you anticipate requiring multi-dc or region replication?
  • What is your plan for maintaining your data solution(s)?

I can say from experience and the experience of my close colleagues that when it comes to high-availability, nothing currently beats cassandra. And it is the only solution that I have come-across that allows for relatively seamless cross data center replication of clusters. (Other solutions, like Riak, provide this at a cost)

One often overlooked aspect of this whole picture is the cost of maintaining your NoSQL solution. If you are just looking at a few servers in one data center or AZ, this may not be much of an issue. As you begin scaling-out, you will find this becoming more-and-more of a burden on your team. I can say that the maintenance costs of a MongoDB cluster are likely to escalate at a much greater pace. And, if you decide that you need to scale to multiple datacenters/regions, this cost can become fairly astronomical. In our case we needed to hire a dedicated team of experts as well as consultants form 10gen. As for cassandra, we are currently running several instances cross-region and zone and these are fairly easily maintained by the development teams. These clusters are closely monitored via various tools and we rarely, very rarely, have any issues that require manual intervention.

My obvious bias

By now, I’m sure you can tell that I feel that cassandra is the superior solution for most any application you plan to implement that requires the benefits of a NoSQL database. That said, I don’t want to discount how great I think that MongoDB can be. I use it frequently for quick proof-of-concepts and small internal applications that will never require the kind of scalability the majority of my work demands. Not surprisingly, I particularly enjoy working with MongoDB when writing in Node.js. They are like peanut butter and jelly. And make the creation of full-stack applications quick and painless. But, watch-out if that application turns-out to be a big hit!

Summary

To quickly summarize, both MongoDB and cassandra offer excellent solutions to different problems out-of-the box. However, I believe that given the demands of todays globally distributed world of applications, the best solution for most applications is going to be cassandra. Yes, there is going to be a bit more up-front work required. Particularly if you are writing a system that is more read-intensive than write intensive. Out-of-the-box, this is not what cassandra is designed for. However, with a little thought as to how you design you data, thinking first of how it will be accessed/queried, you can achieve great performance on both reads and writes.

Again, all of this is based entirely on my own personal experience. I work in an arena where availability, scalability and global distribution is tantamount. This may not be the case for you. Use my above evaluation tools fairly and choose what fits your needs best. However, I can say that you are unlikely to ever be sorry you chose cassandra. And you very well may be a hero for doing so.

 Some interesting related links

 

 

To Relate or Not

Ok, this might seem like an odd title for a tech blog. But, let me explain. This is the second part of my series on current database choices at my organization and how to pick the “right one” for your application. (If you didn’t read the introduction, check-out Tackling a new world order for some background.) And in my mind, the decision for most developers today starts with whether they should stick with what they are familiar with. Which is in most cases a RDBMS like Oracle or SQL Server. Or, go with the new cool kid on the block. (i.e. NoSQL flavor of your choice.) To further explain the title; I tend to define the choice as between traditional RDBMSs and NoSQL. Rather than SQL or NoSQL. Mainly because I view SQL as a language to retrieve data, not necessarily how the database operates. (For right or wrong.)

Why RDBMS?

Relational databases have been around for quite a while. As a result, they have evolved over time to solve many traditional organizational data problems very well. There are a plethora of tools to work with different RDBMS solutions that are currently either lacking or foreign in nature in the NoSQL realm. In addition, most relational databases that have survived, are very mature and well-vetted. You can count on them to “just work” in the way they are designed to be used. As a result, many developers choose to use relational stores because they are comfortable and reliably predictable. There is certainly some validity to this reasoning. However, this absolutely cannot be your only reasoning for making this choice in the world we (meaning software architects and developers) live in today.

The number one reason cited, and rightfully so, to choose an RDBMS over a NoSQL option is an absolute requirement for full ACID compliance. (if you need a crash course on ACID. See this wikipedia write-up.) Essentially, if you need your data to always be consistent across all requests, you are currently best served by an RDBMS solution. For example, an online purchase must debit the purchaser’s account prior to debiting the seller’s. Otherwise, you can create chaos.

Relational databases are also a great fit for small datasets that are likely to remain fairly stable. HR data is a commonly supplied example. No matter how fast your future start-up is likely to grow, it is not likely to grow at such a rate that you need the high-velocity read and write capabilities of a NoSQL option. In addition, the type or shape of HR data is highly unlikely to vary much over time.

In addition, relational databases allow you to decompose data into distinct parts to reduce duplication. (Best known as data normalization) This has several advantages beyond the simple fact that you can store your data in less space. Decomposing and normalizing data allows the developer to retrieve parts of data easily. And perform various bits of complex analysis on this parts. One increasingly popular use-case is to pump subsets of data extrapolated from other stores such as Hadoop into RDBMSs for slicing and dicing that data in ways made simple with the power of SQL.

Why not RDBMS?

The biggest driver for companies to move away from traditional RDBMS solutions is pure scale. In particular, with the rapid adoption of cloud technologies to provide elastic computing environments. Scaling relational data stores has traditionally meant vertically scaling. That is, adding more computing power to existing machines. In the world of the cloud, scaling is most-efficiently done by scaling-out. Adding more nodes to a cluster to distribute load. This has many advantages over traditional scale-up options. Most notably, that it is very difficult to add computing power, be it CPU cores, memory, faster drives, etc. to a server. It is very easy to add another node to a cluster. Particularly in the world of cloud computing where it can be as simple as a few mouse clicks. However, even in a datacenter, it is easier to add another, less expensive server and add it to an existing cluster.

This is not to say that scaling-out a relational database is impossible. It can and has been done. However, it is historically VERY difficult and prone to many issues.

The typical solution to scaling an RDBMS is called sharding. This is where you distribute the data across nodes based on a pre-determined key. The often-cited simplistic example is sharding users alphabetically by last name. So, given two nodes, you would have A-M on one node and N-Z on the other. However, this is really a bad example as names are not evenly distributed across the alphabet and likely will cause hot-spots. And, ironically, this common example illustrates how difficult it can be to choose the correct key to shard a database with. If you do find yourself in the situation where you need to shard your RDBMS, I hope you can find a decent key to shard on or, alternatively, can use an auto-shard solution that shards on a generated key. Otherwise, good luck! That said, I don’t want to make this blog post a dissertation on sharding so, I will leave that subject here. There are many, many articles available for you to research sharding further if you so choose.

The second reason that you might want to choose a NoSQL solution is the pure size of the data that you expect to be working with in your application. One of the greatest strengths of NoSQL databases are their ability to handle very high volumes of data. RDBMSs are bound by their dependence on ACID compliance. The I in ACID stands for Isolation. Which means that a transaction against a relational database requires a lock on the data which causes all other operations on that same piece of data to queue until the operation completes. In contrast, most NoQL databases adhere to the BASE model which favors availability, “soft state” and eventual consistency over the strict model of Consistency and isolation enforced by the ACID model.

NoSQL databases achieve this through highly-distributed models that allow reads and writes to succeed even in the face of high concurrency. The trade-off is that data consistency is relatively lax in that if one application is performing a modification to a record that another application is reading, there is a chance that the application reading the data will not get the most-current state of that data as defined by the state set by the application modifying the data. To keep the reading application from being blocked by the writing application, it will be directed to a different node. That node will be eventually updated with the modification made by the first application. But, there is still a period of time when the nodes state is out-of-sync. This would not occur in an ACID compliant system. However, in most use cases, this is perfectly fine.

One more common use-case for NoSQL data stores is where the schema of the data to be used is either not known in advance or expected to be variable. A major impetus behind the explosion of NoSQL databases, particularly among developers, is the fact that they tend to be “schemaless”. This means that a developer can begin developing a system without knowing the exact data requirements. This can greatly increase time-to-market as an application can be deployed with a subset of data and fields can be added as they are determined to be required. This flexibility is tremendously powerful and would be very painful if not impossible in the relational world.

Summary

Basically, any application could make either a relational or NoQL database work for them. However, you really should use the right tool for the job. Just because you may be comfortable with, or even an expert at, SQL and RDBMS data stores you should not choose it for your application if it doesn’t fit. Ultimately, if you make the wrong choice, you will pay for it. Now or later.

So, in summary, here is my basic advice for when to choose a relational database or NoSQL.

RDBMS

  • Small, well-defined, dataset.
  • Relatively small data velocity.
  • Data absolutely must for consistent across all reads.
  • No way around joins.
  • Global scalability can be sacrificed or you have significant resources to manage a scaling option.
Examples: HR data, financial transactions, highly-complex data analysis

NoSQL

  • Large volume of data.
  • High velocity data ingestion or consumption.
  • Multi-zone, multi-region scalability is required.
  • Data can be modeled in such a way that joins are not required.

Examples: Activity streams, monitoring data, any data requiring global availability

Hopefully this helps add some clarity to the decision to be made between a traditional RDBMS or a NoSQL database. Obviously, there is not enough room for a comprehensive study of the subject. But this should give most users a good starting-point.

In the next part of this series I will compare Cassandra to MongoDB and explain use-cases for each of them.

Tackling a new world order

Sun-Rise-On-Earth-From-Space-Wallpaper-600x300So, my company recently (within the past couple years) made a major tactical shift from the world of Microsoft-centered development to a very strict adherence to open-source technologies. Driven primarily by our rapid growth and the inability/difficulty for much of our existing MS infrastructure to grow with us. It has served us well thus far but, we have grown beyond the point where it made sense to keep paying the massive licensing fees and developing creative workarounds for scalability. (I have no doubt many would disagree with this rationale. Just my opinion.) This has caused a lot of confusion and consternation throughout the organization.

For the most part, the development and architecture groups embraced this change. Like myself, we have a crew that is interested in the best solution for the job. Regardless of the fact we have been working, almost exclusively, in the MS .NET world for years and this change means a major commitment and investment by all parties. (Those not interested have moved-on without any hard feelings, hopefully, on either side.) However, this is still a fairly steep hill to climb. It’s not that we just need to switch languages from say, C# to java. That is the easy part. There is a whole new ecosystem to adjust (or re-adjust) to. In addition, we are changing our entire system architecture from traditional, datacenter-deployed, applications to cloud-based, globally-distributed applications. This involves all of us and takes significant learning and knowledge sharing across groups as well as new development practices, QA processes and deployment mechanisms and processes.

However, all of this is far beyond the scope of this blog post. Today I just want to begin by kicking off a series regarding one subject that has come-up a lot and is becoming a pain point for teams across the organization. That is that, now that we are cutting ties with our historical datastore (MS SQL Server), what open source database should I choose for my application?

Part of this decision has been helped by the fact that we have had various members of our development and architecture teams do formal reviews of various options. Currently, we have three solutions that are accepted as zero-barrier options. (By this, I just mean that the use of them has been approved and does not require additional justification. Not that there are not barriers such as a learning-curve.)

The current options are as follows:

  • Cassandra
  • MongoDb
  • MySQL

Great, polyglot persistence. No trouble there. Just use the right store for the job. Right? Well, remember, this is a whole new world for most of us here. (And, arguably, for everybody that might read this post.) We are used to one option, MS SQL Server. We would model our data using RDBMS standards. Normalizing data as best we could while balancing relative performance. Then model our application domain objects to stored procedures that did various joins, subqueries, etc. to get the data in the exact “shape” we needed to work with.

So, now the directive comes along to change your data store with a general preference being the use of the approved NoSQL options. So, the question, predictably, arises “Why do I need to change the way I have always worked?” Well, this is a complicated question to answer. And, really, the answer is you don’t. However, as with every decision you will ever make, you have to be willing to make trade-offs if you want to stay the course. The trade-offs in the new “cloudified” world of sticking with a traditional RDBMS like MySQL vs making the shift to something like Cassandra or MongoDb, are pretty steep. But there are still use cases where it would make sense.

So, in the next few posts I plan on tackling this question as well as some others to the best of my ability. In addition, I would like to lay-out what I feel are good guidelines for choosing among the various types of databases available given a knowledge of the following:

  • The type or shape of the data you plan to store or use.
  • The volume of the data. (How big is the data? Either individual items or number of items.)
  • How you plan to use the data.

Other ideas will likely creep-in as I tend to stray off-topic from time-to-time. But, I will attempt to stay focused and do my best to add value to this overall discussion both for my organization and to those out there that might be going through similar transitions at their work.

I hope this series is informative. Please add responses or ask questions as I go. I have a thick skin and can handle criticism. I tend to learn the most from those that don’t agree with me.

Thoughts on Gluecon 2014

Just trying to digest all of the great conversation from this years Gluecon in Denver. Overall, I’d say it was a great success. Lots of interesting topics covered, not surprisingly, centered on the cloud, big data, DevOps and APIs. However discussions went well-beyond the standard discussions around high-level concepts, or “Use this cool new tool. It will make you the hit of the party!”

There were certainly a broad range of tools and products on display but, what I found (maybe naively?) was that, for the most part the talks were very well-vetted by the hosts to limit marketing spiel and offer really pertinent content to help us practitioners do our jobs in the best, most open, manner possible.

Some particularly strong takeaways for me were the following:

API design is not something you can fudge any longer. It takes serious thought. Real top-down thought. Ahead of time. You must think of how you want your API to be shaped. Meaning, start with your clients. Architects and developers have to start by thinking “If I were a client, with no knowledge of the implementation, how would I expect to interact with it?”

For example, let’s say we have a some content, say books, that we want to offer an API for. The first thing you should think is “Who are the clients of my API likely to be?” Depending on your product, your clients may be web and app developers in your company or some other company that wants to use our API to offer content to their customers or both. Either way, your client is likely to be a web or app developer. So, now that you know this, start designing your API to their needs. If you are lucky enough to know your potential clients, USE THAT ADVANTAGE!

More on this topic in future posts…

Building for scale has never been easier. Or more challenging. Sound like I’m confused? Well, maybe. But, what I mean is that the tools to build highly scalable systems have never been so available to us developers and architects. There was a time, not too long ago, that to build an application to handle the type of load that APIs like those from Twitter, Facebook, and many others are seeing  these days, you had to, typically, overbuild up-front. Over-provision hardware (even if just reserving rackspace and creating standard hardware specs to hasten hardware delivery time), shard your database of choice from the get-go (or at least think logically how you might), build complicated threading and synchronization logic into your code, etc. etc.

Now, while you still need to consider these things up-front, you have choices to ease the burden. Obviously, choosing a hosted cloud solution like aws, rackspace, azure, etc. is, at least in my humble opinion, a no-brainer. At least for most organizations that don’t have the resources of a google or Microsoft. With this decision made, you can start focusing on your app. And in that realm there are more choices than ever as well. From brilliant, auto-scaling, sharding, replicating databases like cassandra or riak (and others), right down to the languages you use. Java 8 comes with new, improved features like completable futures and the new, improved stream class. Then you have options like scala, node.js, etc. Take your pick.

But, this plethora of options also leads to the second part of my statement that this has also never been a more challenging time to build scalable apps. First, you have more to evaluate, and thus learn. Don’t get me wrong. The constant change of this field is the reason I got into it in the first place. I thrive on change. But not everybody does. Even on a given, hand-selected, team you are likely to have dissenters and individuals digging their heels in the dirt. Image how THIS concept scales to large teams and organizations.

That said, I see this as an exciting time of change and progress for our industry. And I/we can’t convince everybody of this. So, get on-board or get out of the way!

Deploying applications to the cloud must be quick, repeatable and predictable. Containers are the future. Learn the concepts. Pick a tool or tools and learn that/them. Then, when (not if) things change. You’ll be better prepared for it. That’s it. (Partially because this is an area I’m admittedly weak in myself.)

API SDKs suck! Ok, so I actually do not buy into this, but it was a very common theme (both sides well-represented) at this years Gluecon. Thanks mostly to an excellent day one keynote by John Sheehan at Runscope “API SDKs Will Ruin Your Life”.

Like I said, I don’t completely agree with this assertion. But, to be honest, I don’t think John does either. However, he and others made some good points. One that hit particularly close-to-home for me was the double-edged sword of how the SDK abstracts developer’s from the actual interface of an API. This abstraction eases adoption by you APIs clients. That is a VERY GOOD thing! However, as John stated, the vast majority of issues that occur with API integrations are “on the wire”. Meaning, more or less, something is wrong with the request or response. However, if you abstract this interaction from your clients, all they know is “my request did not succeed”. More API savvy developers may take the next step of inspecting the request/response before contacting you. But, if they do, barring an obvious issue like malformed requests or forgetting to pass auth, they will likely jus be faced with an unintelligible error message of some sort.

So, my counter to this argument is three-fold. Document you APIs well. Be it the old-fashioned way by manually producing help docs or with something, in my opinion, infinitely better like swagger. Just do it! It will save you many headaches in the future. Secondly, back to my first point, design your APIs intelligently with your clients in mind first. If your API is easy to navigate for an average person (test it out on somebody!), it will make the interaction less painful to begin with so your API may, potentially, need less abstraction by the SDK. Lastly I say you must strive to make the errors your API returns just as comprehensible as the non-errors. By this I mean things like returning proper error codes and human-readable descriptions. Not just a generic 400 with “Bad Request” or what have you. I know all-too-well this is hard to do up-front. You can’t predict all the ways requests may fail. But, if you try, you can think of the more common ones and handle them more elegantly. You are likely coding defensively against them to prevent failures on your end anyway. For those that arise after the fact, adapt. That is why you have that rapid, repeatable, deploy process mentioned above.

Summary

Ok, so I have rambled-on waaaayyyy too long and have not even come close to covering the above topics let alone concepts that piqued my interest but need more research on my part to speak to like cluster management with yarn or mesos. But suffice it to say, this is one of the most relevant, content-packed conferences going for the more technical audience. If you missed it this year, I highly recommend searching for the content and discussions sure to be posted in the coming days. And, see if you can make it next year. It will pay-off in spades.

Links

Excellent list of links to this years notes and presentations online provided by James Higginbotham.

http://launchany.com/gluecon-2014/