The acronym NoSQL was coined in 1998. Many people think NoSQL is a derogatory term created to poke at SQL. In reality, the term means Not Only SQL. The idea is that both technologies can coexist and each has its place. The NoSQL movement has been in the news in the past few years as many of the Web 2.0 leaders have adopted a NoSQL technology. Companies like Facebook, Twitter, Digg, Amazon, LinkedIn and Google all use NoSQL in one way or another. Let's break down NoSQL so you can explain it to your CIO or even your co-workers.
NoSQL Emerged From a Need
Data Storage: The world's stored digital data is measured in exabytes. An exabyte is equal to one billion gigabytes (GB) of data. According to Internet.com, the amount of stored data added in 2006 was 161 exabytes. Just 4 years later in 2010, the amount of data stored will be almost 1,000 ExaBytes which is an increase of over 500%. In other words, there is a lot of data being stored in the world and its just going to continue growing.
Interconnected Data: Data continues to become more connected. The creation of the web fostered in hyperlinks, blogs have pingbacks and every major social network system has tags that tie things together. Major systems are built to be interconnected.
Complex Data Structure: NoSQL can handle hierarchical nested data structures easily. To accomplish the same thing in SQL, you would need multiple relational tables with all kinds of keys. In addition, there is a relationship between performance and data complexity. Performance can degrade in a traditional RDBMS as we store the massive amounts of data required in social networking applications and the semantic web.
What is NoSQL?
I guess one way to define NoSQL is to consider what its not. It's not SQL and it's not relational. Like the name suggests, it's not a replacement for a RDBMS but compliments it. NoSQL is designed for distributed data stores for very large scale data needs. Think about Facebook with its 500,000,000 users or Twitter which accumulates Terabits of data every single day.
In a NoSQL database, there is no fixed schema and no joins. A RDBMS "scales up" by getting faster and faster hardware and adding memory. NoSQL, on the other hand, can take advantage of "scaling out". Scaling out refers to spreading the load over many commodity systems. This is the component of NoSQL that makes it an inexpensive solution for large datasets.
The current NoSQL world fits into 4 basic categories.
- Key-values Stores are based primarily on Amazon's Dynamo Paper which was written in 2007. The main idea is the existence of a hash table where there is a unique key and a pointer to a particular item of data. These mappings are usually accompanied by cache mechanisms to maximize performance.
- Column Family Stores were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. In the case of BigTable (Google's Column Family NoSQL model), rows are identified by a row key with the data sorted and stored by this key. The columns are arranged by column family.
- Document Databases were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON.
- Graph Databases are built with nodes, relationships between notes and the properties of nodes. Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which can scale across many machines.
Major NoSQL Players
The major players in NoSQL have emerged primarily because of the organizations that have adopted them. Some of the largest NoSQL technologies include:
- Dynamo: Dynamo was created by Amazon.com and is the most prominent Key-Value NoSQL database. Amazon was in need of a highly scalable distributed platform for their e-commerce businesses so they developed Dynamo. Amazon S3 uses Dynamo as the storage mechanism.
- Cassandra: Cassandra was open sourced by Facebook and is a column oriented NoSQL database.
- BigTable: BigTable is Google's proprietary column oriented database. Google allows the use of BigTable but only for the Google App Engine.
- SimpleDB: SimpleDB is another Amazon database. Used for Amazon EC2 and S3, it is part of Amazon Web Services that charges fees depending on usage.
- CouchDB: CouchDB along with MongoDB are open source document oriented NoSQL databases.
- Neo4J: Neo4j is an open source graph database.
The question of how to query a NoSQL database is what most developers are interested in. After all, data stored in a huge database doesn't do anyone any good if you can't retrieve and show it to end users or web services. NoSQL databases do not provide a high level declarative query language like SQL. Instead, querying these databases is data-model specific.
Many of the NoSQL platforms allow for RESTful interfaces to the data. Other offer query APIs. There are a couple of query tools that have been developed that attempt to query multiple NoSQL databases. These tools typically work accross a single NoSQL category. One example is SPARQL. SPARQL is a declarative query specification designed for graph databases. Here is an example of a SPARQL query that retrieves the URL of a particular blogger (courtesy of IBM):
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
Future of NoSQL
Organizations that have massive data storage needs are looking seriously at NoSQL. Apparently, the concept isn't getting as much traction in smaller organizations. In a survey conducted by Information Week, 44% of business IT professionals haven't heard of NoSQL. Further, only 1% of the respondents reported that NoSQL is a part of their strategic direction. Clearly, NoSQL has its place in our connected world but will need to continue to evolve to get the mass appeal that many think it could have.