Unravelling NoSQL and trying to explain what it is and whether you be interested in it or not is difficult. The term covers a wide range of technologies, data architectures and priorities; it represents as much a movement or a school of thought as it does any particular technology. Even the name is confusing, for some it means literally any data storage that does not use SQL but thus far the industry seems to have settled on "Not Only SQL". As time goes on it is likely that the scope of the term is going to grow and grow until it becomes meaningless by itself and sub-divisions will be needed to clarify the meaning of the term.
Technical leaders have an important role in understanding the available options and adapting the software, products and services most applicable to their own domain. Having a logical and localised strategy for adopting the best of NoSQL is going to be what differentiates success from failure in adoption.
Just as NoSQL presents new challenges it also offers significant rewards to those who can successfully incorporate it into their solution portfolio. The key benefits are going to emerge around improved data comprehension, flexible scaling solutions and productivity. The rich variety of new business models have data storage needs that support them and the decades of coercing data into relational forms lies behind us.
NoSQL is a large and expanding field, for the purposes of this paper the common features of NoSQL data stores are:
Not every product in this paper has every one of these properties but the majority of the stores we are going to talk about support most of them.
"Web scale", as it is commonly referred to, is a capacity planning, scale and provisioning issue that has become pressing for many web businesses over the last five years. As the world becomes more connected it is possible for sites to experience massive variations of traffic. Some of these are related to predictable events: the World Cup or Christmas; others are unpredictable and global, for example September 11th posed massive challenges for news sites. Sites like Facebook have made it easy for sites to experience massive upswings of popularity as items "go viral" and are distributed by global world of mouth.
User-generated content causes particular headaches as the issues of scaling for "read-heavy" websites is well understood with the use of static content and Content Distribution Networks (CDNs). User-generated content means that sites become more "read-write" balanced. Sites like Twitter experience massive surges in write traffic in very narrow time frames (a goal scored or denied, an election declaration or TV finale), their infrastructure needs to adapt rapidly and not be stuck in the wrong mode at the wrong time. The normal approach to scaling has been to add webservers, which works until the traffic through the database (which has historically been a single instance) becomes the bottleneck. The answer then has been to buy progressively more powerful hardware until the database can serve all the traffic. Web scale invalidates this model as you face the dilemma of having to purchase hardware to meet your peak demand (Christmas, the World Cup) but which is operating very far below capacity day to day. For some businesses it is simply impossible to purchase the hardware and licenses to meet their peak demand solely through a single server. These businesses have been seeking a scalable data solution that mirrors their web architecture.
The second driver is the fact that data changes over time. As the business model evolves concepts and data models often struggle to evolve and keep pace with changes. The result is often a data structure that is filled with archaic language and patched and adapted data. As anyone who has had to explain that the value in a column has a different meaning depending on whether it is less than or greater than 100 or that "bakeries" are actually "warehouses" due to historical accident knows that the weight of history in the data model can be a serious drag in maintaining a system or incorporating new business ideas.
The final factor is that the NoSQL technology is now starting to become a commodity. Once an Amazon or Google had no choice but to create a bespoke solution that answered their problems of scale. The cost of writing such a solution prevented enterprises that did not have these issues at the heart of their business model from exploiting this new technology. Recently a series of donations of code to bodies such as the Apache Foundation or other open source groups which provide community-driven support and development, has lead to the possibility of using extremely sophisticated code at little cost in upkeep. Such code puts NoSQL firmly in the reach of smaller companies. Instead of being an esoteric subject, now NoSQL data stores can be downloaded and made part of an enterprise architecture in weeks.
Even the vendors themselves recognise the problem, if they are unable to find a common set of data manipulation operations themselves then it is likely that one or another implementation will become popular and users will either migrate to the product that solves their problem or that all vendors have to implement their market leader's command-set to be competitive.
There are some standards already available such as SparQL, a standard for querying RDF or tuple-data. This could be adapted to both document and graph databases but currently there is nothing that provides a genuine modular set of query syntax that could be compared to SQL.
It is an irony that NoSQL products more complex than the Key-Value stores are likely to have to implement something very similar to SQL if they want to achieve the same broad usage as Relational data products do today. In some ways this fact lies behind the "Not only SQL" slogan, truly doing away with SQL would just be too painful.
From a solution point of view there needs to be a clear analysis of what data is relational and what is stored in relational stores currently but only due to the lack of alternatives. It is also important to review historic decisions to see if they were made with historical constraints in mind. A particular example is the use of a graph database instead of very complex relational tables. It is entirely possible to create sets of many to many relations in relational data and then query the intersections of these relationships but expressing just the relationships may result in a much simpler solution.
There are some obvious areas where NoSQL can be applied immediately. Website content can generally be expressed in terms of document and key-value datastores. Particular examples of suitable situations are forms and wizard-style metaphors. Any web form can find ready expression in a document form. Lookup data is another example, lots of reference data consists of maps, lists and sets, for example, referrers, countries, reasons for cancellation, counties, provinces and states. Looking for these patterns in data should allow identification of opportunities.
Looking more strategically, systems that need to evolve and change their data frequently offer a chance to use a schema-less data store. If being able to migrate data structures without taking the data store offline would be advantageous you have a strong indicator that looking for a NoSQL solution would be valuable.
The following section describes the different types of NoSQL datastores.
Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB Typical applications: Content caching Strengths: Fast lookups Weaknesses: Stored data has no schema
Example application: You are writing forum software where you have a home profile page that gives the user's statistics (messages posted, etc) and the last ten messages by them. The page reads from a key that is based on the user's id and retrieves a string of JSON that represents all the relevant information. A background process recalculates the information every 15 minutes and writes to the store independently.
Examples: CouchDB, MongoDb Typical applications: Web applications Strengths: Tolerant of incomplete data Weaknesses: Query performance, no standard query syntax
Example application: You are creating software that creates profiles of refugee children with the aim of reuniting them with their families. The details you need to record for each child vary tremendously with circumstances of the event and they are built up piecemeal, for example a young child may know their first name and you can take a picture of them but they may not know their parent's first names. Later a local may claim to recognise the child and provide you with additional information that you definitely want to record but until you can verify the information you have to treat it sceptically.
Examples: Neo4J, InfoGrid, Infinite Graph Typical applications: Social networking, Recommendations Strengths: Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc. Weaknesses: Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster.
Example application: Any application that requires social networking is best suited to a graph database. These same principles can be extended to any application where you need to understand what people are doing, buying or enjoying so that you can recommend further things for them to do, buy or like. Any time you need to answer the question along the lines of "What restaurants do the sisters of people who are over-40, enjoy skiing and have visited Kenya dislike?" a graph database will usually help.
Examples: Exist, Oracle, MarkLogic Typical applications: Publishing Strengths: Mature search technologies, Schema validation Weaknesses: No real binary solution, easier to re-write documents than update them
Example application: A publishing company that uses bespoke XML formats to produce web, print and eBook versions of their articles. Editors need to quickly search either text or semantic sections of the markup (e.g. articles whose summary contains diabetes, where the author's institution is Liverpool University and Stephen was a revising editor at some point in the document history). They store the XML of finished articles in the XML database and wrap it in a readable-URL web service for the document production systems. Workflow metadata (which stage a manuscript is in) is held in a separate RDBMS. When system-wide changes are required, XQuery updates bulk update all the documents to match the new format.
Distributed Peer Stores
Examples: Cassandra, HBase, Riak Typical applications: Distributed file systems Strengths: Fast lookups, good distributed storage of data Weaknesses: Very low-level API Example application:
You have a news site where any piece of content: articles, comments, author profiles, can be voted on and an optional comment supplied on the vote. You create one store per user and one store per piece of content, using a UUID as the key (generating one for each piece of content and user). The user's store holds every vote they have ever made while the content "bucket" contains a copy of every vote that has been made on the piece of content. Overnight you run a batch job to identify content that users have voted on, you generate a list of content for each user that has high votes but which they have not voted on. You then push this list of recommended articles into the user's "bucket".
Examples: Oracle Coherence, db4o, ObjectStore, GemStone, Polar Typical applications: Finance systems Strengths: Matches OO development paradigm, low-latency ACID, mature technology Weaknesses: Limited querying or batch-update options
Example application: A global trading company has a monoculture of development and wants to have trades done on desks in Japan and New York pass through a risk checking process in London. An object representing the trade is pushed into the object store and the risk checker is listening to for appearance or modification of trade objects. When the object is replicated into the local European space the risk checker reads the Trade and assesses the risk. It then rewrites the object to indicate that the trade is approved and generates an actual trade fulfilment request. The trader's client is listening for changes to objects that contain the trader's id and updates the local detail of the trade in the client indicating to the trader that the trader has been approved. The trading system will consume the trade fulfilment and when the trade elapses or is fulfilled feeds back the information to the risk assessor.