Relational vs. non-relational databases: Which one is right for you?
- select the contributor at the end of the page -
OK, so we know that you're here to weigh the technical differences between relational and non-relational databases. But before diving right in, it's essential to whiteboard your data model and determine the priority that scalability poses for that particular model. Draw it out on paper if you must; don’t be afraid to use low-tech tools like different colored pens. The point is to start with a clear idea, and a picture can certainly help you accomplish that.
In order to whiteboard your data model as accurately as possible, you'll need to ask the right questions. Vague or inadequate questions, not surprisingly, result in bad or insufficient data. And that's not what we want when programmatically modeling and solving someone’s real life problem—regardless of whether you select a relational or a non-relational database. Now that we’ve got that out of the way, let’s move on.
So what is a relational database exactly? Relational databases like MySQL, PostgreSQL and SQLite3 represent and store data in tables and rows. They're based on a branch of algebraic set theory known as relational algebra. Meanwhile, non-relational databases like MongoDB represent data in collections of JSON documents. The Mongo import utility can import JSON, CSV and TSV file formats. Mongo query targets of data are technically represented as BSON (binary JASON).
Relational databases use Structured Querying Language (SQL), making them a good choice for applications that involve the management of several transactions. The structure of a relational database allows you to link information from different tables through the use of foreign keys (or indexes), which are used to uniquely identify any atomic piece of data within that table. Other tables may refer to that foreign key, so as to create a link between their data pieces and the piece pointed to by the foreign key. This comes in handy for applications that are heavy into data analysis.
If you want your application to handle a lot of complicated querying, database transactions and routine analysis of data, you’ll probably want to stick with a relational database. And if your application is going to focus on doing many database transactions, it’s important that those transactions are processed reliably. This is where ACID (the set of properties that guarantee database transactions are processed reliably) really matters, and where referential integrity comes into play.
Referential integrity (and minimizing ORM Impedance Mismatch)
Referential integrity is the concept in which multiple database tables share a relationship based on the data stored in the tables, and that relationship must remain consistent. This is usually enforced with cascading actions of adding, deleting and updating. To illustrate an example of enforcing referential integrity, let’s consider an app that helps victims of human trafficking locate a safe house and access victim services in real time.
Suppose city or county X has two tables; a Trafficking Victim Shelter table and a Trafficking Shelter Funding table. In the Trafficking Shelter table we have two columns; the Shelter ID (which could be its EIN/FEIN number) and the name of the shelter. In the Trafficking Shelter Funding table, we also have two columns; the Shelter ID and the amount of funding received for that given Shelter ID. Now, suppose a dearth in funding forced Shelter A in city/county X to close its doors. We would need to remove that shelter from locale X since it’s no longer in existence. And since Shelter A also exists in the Shelter Funding table, we need to remove it from there as well. By enforcing referential integrity, we can make this accurate -- and with minimal headaches.
First, define the Shelter ID column in the Shelter table to be our primary key. Then, define the Shelter ID column in the Shelter Funding table to be a foreign key that points to a primary key (that is the Shelter ID column in the Shelter table). Once we define our foreign-to-primary key relationship, we need to add constraints. One constraint in particular is known as cascading delete. This means that anytime a shelter is deleted from the Shelter table in our database, all entries for that same shelter would be automatically removed from the Shelter Funding table.
Now, take note of what was designated as the primary key, and why. In our little example of anti-trafficking charities, every non-profit NGO with 501(3)c status is issued an EIN, much like an individual’s social security number. So, in tables where other data is linked to any particular trafficking victim’s shelter in the shelter table, it makes sense to have that unique identifier serve as the primary key and to have the foreign keys point to it.
Keep in mind, there are three rules that referential integrity enforces:
- We may not add a record to the Shelter Funding table unless the foreign key for that record points to an existing shelter in the Shelter table. You can think of this as a “No Unattended Child” rule or a “No Orphans” rule.
- If a record in the shelter table is deleted, all corresponding records in the Shelter Funding table must also be deleted. The best way to handle this is by using cascade delete.
- If the primary key for a record in the Shelter table changes, all corresponding records in the Shelter Funding (and other possible future tables with data relating to the Shelter table) must also be modified using something called a cascade update.
The burden of instilling and maintaining referential integrity rests on the person who designs the database schema. If designing a database schema seems like a daunting task, consider this: Prior to 1970 (when the relational database was born) all databases were flat; data was stored in a long text file called a tab delimited file where each entry was separated by the pipe character (“|”). Searching for specific information to compare and analyze was a difficult, tedious, time-consuming endeavor. With relational databases you can easily search, sort and analyze (usually for comparison to other data purposes) specific pieces of data without having to search sequentially through an entire file (or database), including all the data pieces you’re not interested in.
In the previous example of a relational database (Postgresql), we don’t need to search through an entire database worth of information just to find the information on a shelter that either had its funding slashed or that was forced to close for lack of funds. We can use a simple SQL query to find which shelters closed in a particular region or locale without having to traverse all of the data, including shelters not in that specific area, by using a an SQL SELECT * FROM statement.
Object Relational Mapping (ORM) refers to the programmatic process of converting data between incompatible type systems in object-oriented programming languages (like Ruby). In the context of a Ruby program (a Rails app in particular), the concept of ORM libraries was briefly discussed in our tutorial on Getting started with Rails.
When to non-relate
While relational databases are great, they do come with trade-offs. One of those is ORM Impedence Mismatching, because relational databases were not initially created with the OOP languages in mind. The best way to avoid this issue is to create your database schema with referential integrity at its core. So, when using a relational database with an OOP (like Ruby), you have to think about how to set up your primary and foreign keys, the use of constraints (including the cascade delete and update), and how you write your migrations.
But, if you’re dealing with a phenomenally huge amount of data, it can be way too tedious, and the probability of error (in the form of an ORM Impedance Mismatch issue) increases. In that situation you may need to consider going with a non-relational database. A non-relational database just stores data without explicit and structured mechanisms to link data from different tables (or buckets) to one another.
- The need to store serialized arrays in JSON objects
- Storing records in the same collection that have different fields or attributes
- Finding yourself de-normalizing your database schema or coding around performance and horizontal scalability issues
- Problems easily pre-defining your schema because of the nature of your data model
Suppose we were developing an app, and our example for the trafficking victim safe houses was part of a data model that was too complex and had too many tables, making referential integrity extremely difficult. We might handle the representation of our trafficking victim service-providing NGO’s like this instead, using Mongo:
But it isn’t only MEAN stack developers who decided that a non-relational database was the best way to go. Steve Klabnik (a well-known member of the Ruby/Ruby on Rails community and the maintainer of the open source project Hackety-Hack) also chose MongoDB. Of course, he had to make trade-offs in taking this route. This included difficulty in getting Hackety-Hack refactored to be set up for user authentication with Facebook, Twitter, Linkedin and Github accounts. But other Rails developers also like Mongo for its superior horizontal scalability.
One of the biggest advantages in going with a non-relational database is that your database is not at risk for SQL injection attacks, because non-relational databases don’t use SQL and are, for the most part, schema-less. Another major advantage, at least with Mongo, is that you can theoretically shard it forever (although that does bring up replication issues). Sharding distributes the data across partitions to overcome hardware limitations.
Non-relational database disadvantages
In non-relational databases like Mongo, there are no joins like there would be in relational databases. This means you need to perform multiple queries and join the data manually within your code -- and that can get very ugly, very fast.
Since Mongo doesn’t automatically treat operations as transactions the way a relational database does, you must manually choose to create a transaction and then manually verify it, manually commit it or roll it back. Even the documentation on the MongoDB site warns you that without taking some potentially time-consuming precautions, and since documents can be fairly complex and nested, the success or failure of a database operation cannot be all or nothing. To put it simply, some operations will succeed while others fail.
Of course, this all brings us back to the beginning; knowing how to ask exactly the right questions in order to effectively whiteboard your data model. It's this key step that will allow you to determine the best route for you regarding your application’s flow. Taking the time to pinpoint the right questions will serve as a solid guide when choosing the programming language to write your application in, and the use of one particular database over another.