The definitive guide to NoSQL databases
Today, where the ability to handle large volumes of diverse types of data is a given, NoSQL databases have become business-critical. This prescient book is packed with ideas and explanations that are as valuable today as when it was first published.
[Podcast] The rise of NoSQL
In the past decade, NoSQL has gone from being an interesting experiment to becoming business critical. We catch up with Martin Fowler and Pramod Sadalage, co-authors of NoSQL Distilled, to understand why the database technology took off and where it’s proven its capabilities in the enterprise and how thinking around issues such as persistence models has evolved.
Rebecca Parsons: Hello, everyone. Welcome to the Thoughtworks Technology Podcast. My name is Rebecca Parsons. I'm one of your co-hosts, and I'm here with Zhamak Dehghani. Hello, Zhamak.
Zhamak Dehghani: Hi. Rebecca. Hello, everyone.
Rebecca: We are joined today by two guests Martin Fowler and Pramod Sadalage both of who been with us be before. Today, they're going to be talking about a book that was published just under 10 years ago called NoSQL Distilled. Martin, Pramod, thank you for being here and welcome.
Martin Fowler: Happy to be here.
Pramod Sadalage: Thank you, Rebecca and Zhamak.
Rebecca: Let's start with something simple. What really does NoSQL mean? What is the scope of what you're talking about in this NoSQL Distilled book.
Martin: Well, starting with what NoSQL means. Originally, it was no more of than a hashtag for a meetup. A bunch of people wanting to get together, set up a meetup meeting wanted some little short hashtag for the meeting, and they picked hash NoSQL, but rapidly, that turned into a whole movement of people. This is around the late 2000s of people exploring storing data in something other than a relational database.
I can't remember who was at the very first meetup, but it was a number of people including, with names of databases, some that we're still familiar with, and some that died of death. Then other people got in on the scene as well and said, "Hey, we're interested in non-relational databases and stuff." That was a whole NoSQL thing. There was controversy over the time about whether it meant no SQL as in negative to SQL or a N. O. SQL meaning not only SQL which doesn't really matter because most people who talked about a NoSQL database were explicitly saying something other than relational.
What we were interested with doing the book was just trying to explain what we saw going on in that space particularly from the background of people who were actually quite comfortable with relational technology. Pramod knows more about databases than most people who've ever lived, I think. I've always been comfortable with relational databases, but we're also aware of the limitations, and the fact that sometimes it was something alternative that we could consider. What we wanted to do was provide a brief guide to what that space looked like at that time.
Pramod: That's a good summary, Martin. There were also a bunch of projects we were doing at that time if I remember correctly. We were exploring like MongoDB had just come around, there was GraphDB, Neo4J and we had some projects and we are thinking like, "What are the design trade trade-offs of using one or the other? What situations would drive you to use something like a document database or something like graph database, and how would you make those choices?" Would you just give up to relational databases or just pick up like say some document database, or is there some use case where I would use both?
One would be probably for story financial transactions, and the other probably to store some content that just shows up on the webpage and things like that. That's where it started. Then the whole notion of NoSQL is just half of the title. The other is the Polyglot Persistence angle of that is like, "Should I just stick with one or use more than one type of database in a given enterprise or in a given application?" That's what we are trying to explore and show. At that time, there are four big major ones, like key value, document, column, or family and graph databases. We explore those four different types of databases, and how you would use them. We are able to use them and things like that.
Martin: I would say the Polyglot Persistence is really the key point. The idea that instead of the view that was pretty much commonly held before then that whatever data you have you stick it in a relational database. Your company's bought Oracle or DB2 or whatever, stick everything in there. Rather, you should think about, what would be the best data store for the problem. What's the right data model? Is the relational model a great data model for lots of data, but not for others? What is the right fit of data model? Then what's my best access pattern?
Particularly, at that time, as we were shifting into a world where instead of a single big server handling all our data, we were having this idea of lots and lots of servers often in unreliable situations, a much more distributed situation. Relational databases certainly at that time were not built to go across a distributed unreliable network. Many of these NoSQL databases were explicitly designed for that kind of situation.
It requires people to think much more about what we want to do with data and also the fact that even within a single project scope, you might have some data you might want to store some way and some data you might want to store another way, because of how you want to model it and how you want to access it, how available it has to be, what your consistency requirements are, et cetera.
Rebecca: Well, that's one of the things that I always felt when you look at the way persistence, in general, has matured or evolved. There were object databases, but instead of object databases actually becoming legitimate contenders in their own right, basically, everybody just said, "Okay, let's write an object-relational mapper because, of course, our data still has to end up." It felt like what shifted that thinking away from, "Okay, I'll just put a layer between how I want to represent my data and the relational model to allow me to continue to use that."
What changed was that need for massively distributed databases and the credibility, if you will, that came from many of the internet giants saying, "Actually, no. I'm not going to use the relational model to store my stuff because I can't get the level of scale." How much do you think the credibility that came along with the Googles and the Amazons looking at a different model? Do you think that was why that came to be or was it really just the technological limitations of not being able to get the scale across the networks, et cetera?
Pramod: I would say maybe two different things that came around at the same time. One is the Googles and the Amazon and Yahoo and the related companies putting on paper showing how these things can be done, like the whole Hadoop and MapReduce and all the other related stuff. The file systems and things like that papers about which will be written. There is the Cap theorem that was famously written at that time that could also argue about which two of those either consistency or availability of partition tolerance you care about and things like that.
The other, I would say, a subtle design revolution in some ways was about the notion of don't reach into other people's databases. That was one of the reasons why object databases and things didn't really take off is they were all talking to the same thing and people are trying to see how other objects can access those databases and things like that. One of the patterns like enterprise database-based integration that Gregor Hohpe talks about in his book was the concept of everything can have its own data store and integrate via APIs and not via talking to a database. Once you take on that philosophy, it liberates you from the concept of having to make this database readable for everyone.
Instead, it could just be the API that is readable for everyone, and I can use whatever I want behind the scenes. It could be an object database, or a graph database, or something else. As long as I can provide the data using an API, then I'm free to manage that data however I wanted to. I think big companies, internet giants, actually exploited that concept itself to get to a point where they can give out data using an API instead of other services reaching into the databases directly.
Because I think one of the powers of relational database is if anybody actually wants to explore your database, they can without knowing the metadata, without knowing your schema, without knowing-- They can literally go explore and find out things on their own, and that's a power of relational database, but at the same time, it also stops you from doing other things at the data layer. Martin?
Martin: Yes, when you say that it reminds me I was at a Foo camp workshop in the mid-2000s and Jeff Bezos was there. He was less godlike then than he is now, of course, but I remember him distinctly saying something like, "Oh, for 80%, 90% of what we do at Amazon DBM is fine, we don't need Oracle. DBM, for those who don't know, was a really basic key-value store that came with Unix Operating Systems since about the year dot.
If you think about it for a lot of what organizations like that do, a key-value store is actually most what time, what you want, if you want to look up an item in the catalog, your key-value store get the key, go to the item in the catalog. Want to find out about an order, get the order number, look it up in a key-value store. In fact, we've seen that one of our best examples of using NoSQL databases is with a big Amazon-like retailer in Europe, where we are using Mongo and it was very effective in that situation. Because, again, a key-value store in that situation works really well for the problem you have to hand.
I also think it's no coincidence is from outside that also one of the things that Amazon is well known for doing was, of course, hiding its data storage behind services and going with APIs and service level integration rather than using the database as an integration mechanism. In fact, one of the things I think at Thoughtworks that pretty much all of our senior technologists have been battling against the whole I been here is the problems that occur when people use a database as an integration mechanism and shared databases are really a terrible integration route, but unfortunately, they route a lot of organizations went down that path during the '90s and 00s.
Zhamak: Perhaps another underlying trend that hasn't stopped or hasn't slowed down has been the digitalization. That's every touchpoint, every process, everything that we do is turning those interactions into data and data of a very diverse nature as well. The scale of diversity of the data has evolved in a way that we can't just put everything in a relational structure and we started exploring with nester trees and documents and graphs and relationships and time series.
It's just, this has grown into so many different fragments of non-relational expression of data. I really liked, Martin, what you said there was that this is about you separated the two concerns. One was the storage, how we are storing and modeling the information in different modes, and then the Polyglot nature of storage. Then also the multimodal access, how we are actually accessing the data.
If I may refer back on the title that you chose, NoSQL, I think we continue to just grow different modes or modality of storage and modeling of the data, but it seems like we're all also at the same time converging back on SQL as an interface. I wonder what your observations are in SQL is just this evergreen way of accessing data and what does that mean? What is that telling us?
Pramod: At least the way I think about it, is it's telling us that SQL is this ubiquitous language. Other than the whole notion of storing data in a relational model, SQL was always a query language that was very ubiquitous and anybody could learn. In one of the tops, I showed my daughter typing on a keyboard and I said, she's typing select stuff from some table because nothing else, people can at least do a select stuff from a table without training and with our thing.
There are so many tools that support the concept of SQL, like for reporting analytics and bunch of things. What's happening is lot of database providers, either relational or non-relational, are trying to provide that interface so that it's easy to use. In underneath, the storage may be something else like CQL, the Cassandra Query Language, is not necessarily standard SQL, but it's trying to mimics SQL so that the usage is easy for people to pick up and that kind of stuff. While the storage is still the column family storage that it have.
Similarly, there are lots of other types of product that have come about that are trying to mimic the ease of use of SQL while still providing the flexibility of storage and the flexibility of distributing your data at the same time. I think they're trying to get to the transition for a developer or analytics portion or a data analyze make that easier for them to transition to a different product, different technology while at the same time, give them options on storage and things like that.
Zhamak: SQL is simple, but how far will it take you? At the end of the day is an Algebraic language. We're just running algebra and you say, this transition that we're seeing from mathematics to computation from writing this kind of statement that can get so nested and hairy and hard to understand to writing algorithms and programs that process the data. SQL just seems to be a nice tool to use but it won't be the only tool to use because with that simplicity comes a certain level of limitations as well.
Martin: I have mixed feelings about this actually though. I think SQL is very good at handling a certain shape of query but it breaks down really badly as you begin to move outside its area, particularly when you start having nested queries and things like that. SQL can get horrendously complicated. People who are good programmers struggle with complicated SQL expressions. One of the things that really reminds me of this is I've been doing a bit of data analytics using the R programming system. R has this library, very vital library to using R properly called dplyr, D-P-L-Y-R, which basically allows you to build pipelines of operations on tables.
Those operations, some of them are familiar to any programmer who does less processing with filters and maps and reduces has those but also allows you to do joins. You can construct really powerful expressions by using pipelines in this way. I find it way easier to work with SQL for more complicated cases. If I'm doing what's a simple filter project operation, yes, SQL can work quite nicely and probably maybe one level of groups. That's not too bad, but when you do anything more complicated, then things start breaking apart and then the pipeline approach begins to be a lot more attractive.
What's true in both those cases is you're operating on relational data. If you're operating on effectively tables, then the mechanisms and the way that you can combine from different places by using joins is a really nice mental model to operate with. Of course, when you're not working with something that goes nicely with tables, like hierarchic structures, then suddenly things again, start flying apart. One thing that relational database has never been very good at is dealing with hierarchies, like parts breakdowns and things like that. They struggle because the model doesn't fit the data.
I think a large part of this is recognizing where does the model fit the data, as well as while using SQL or some other mechanism in order to assemble that data. Of course, we have other query languages now like GraphQL, things of that kind. How much data access these days is key-value lookup because that's still one of our best ways to get at a piece of data.
Rebecca: The book was published in late 2012/2013. It's been around for a while and the persistence landscape has changed quite a bit. What do you think is different now and how might that affect the trade-offs that people are making with respect to the choices they're making about where I am going to persist a particular piece of information and how will I go about retrieving it and using it?
Pramod: Back in maybe 2010 to '12, before the book was actually written, we're dealing with this stuff. Cloud providers had relational databases as a service, not as mature. At the same time, we had the NoSQL database providers, also, not in the cloud I might add. The choices were vast and there are many trade-offs to be talked about, like do I install this myself? Do I run this myself? Is it in a data center, my own data center or run it on a cloud? Similarly on the relational, on the admin side in the cloud there are also lots of questions about is this mature enough and things like that.
Since then, the cloud providers have come up with a bunch of options that you can talk about either in the relational world or in the non-relational world. If you look at Azure there's Cosmos available in column store or in graph store and kind of stuff. On the AWS side, we have Aurora and Redshift different types of databases available Neptune, different types of databases available, while the newer scale providers have also come up with database as a service. Like if you want a graph, you can do Aurora, I think that's what it's called available now as a database, as a service, and same thing with MongoDB, Atlas.
Similarly, we also have for larger data sets nowadays, we have Snowflake available for us and things like that. These choices, one increases the number of choices, like I say. At the same time, I think people are also thinking about the trade-offs in a much different way. I can get the same data scalability and things like that, and it is, comes as a service, so the default choice now, if you're doing something new is go to a cloud provider. If you want SQL or no scale or non-relational, that's a little bit design choice and things like that.
Like what do I want to store like Martin saving? Is it a key-value store or do I just want to store a document? Or just use yes, three as a storage layer and put processing engine on top. All these choices are making it easier to think about what is it that we want to do. The other thing also available nowadays is much more resilient SQL store, like CockroachDB is a very good example of that, that you could think of much more resilient even in like AWS, you could use my SQL with hyperscale or like, I think globally available tables and things like that, where you can say, if I use this database, I want this table to be globally available and they take care of all the distribution, and all that kinda stuff.
That gives many more choices, even though you stick with a given storage pattern, or a given relational model or non-relational model. I would say the choices have increased enormously, but at the same time, people are thinking within the same client product. Like if I'm using AWS for all of my needs, I'm probably sticking within the AWS ecosystem to get all my worked up. Probably that reduces the number of choices, but I also make design choices a little easier also.
Zhamak: With that diversity, of course, there's complexity making decisions, you have more choices, you're paralyzed, making wish choice. I wonder what's going in the shareability of data and interoperability. There's one decision that we talked about this in microservices that you choose to storage of your choice, to put the data in that suits the structure and model of your data, and then expose APIs.
The APIs would allow sharing the data, you don't have to integrate through the database. I wonder, are we getting better in exposing and externalizing data and sharing data across these very diverse modeling? Or are we actually in trouble and end up converging to one modeling because it's just so hard to share different models of data? What have your experience been on the impact of diversity and Polyglot storage on data sharing and some of the trends perhaps that we're seeing.
Pramod: Sure. Nowadays, whenever you think about platforms or people that are building platforms, are talking about these kinds of stuff, there's usually a lot of thought on like, what is my data catalog? How do I want to build a data catalog? What are my APS standard? How am I going to share the data? Is it like JSON, Avro or some other format? There's a lot of upfront thought being put into how do I want to share data? What are the standards around that?
Like Martin famously says, just because it's a schemaless database, doesn't mean there is no schema, the schema is there everywhere it have to figure out how do I share that schema, and come up with standards of those things. I see like in the architecture or in the architects world, a lot of that being put into how do I share this? How do I create standards? How do I expose this data? Things like that.
I think there is work going on, at the same time, some of it is not mature and it's very easy to fall into the trap of, I build my own stuff, and then other teams have no idea how to figure out what is the data that I'm giving out, what is the format, and how are the aggregates? What are the aggregates at? What level they are the aggregates and things like that. There is still a lot of work to be done.
I think at the same time, people are trying to put the effort to create data catalogs, make it easier to share data, make it available, or even create events when things happen so that other consumers can consume the events without having to go and ask, stuff like that's happen.
Zhamak: It looks like there is a trend as we move toward using data beyond operational of transactional application and using data for analytics and training machine learning models, data sharing becomes more and more important beyond your operational RESTful APIs. As you said, enabling the discoverability of the data and also upfront thinking about the standardized way of sharing data, and some of these standards are being adopted like the parquet and Avros and formats like that, but they're definitely there is a space that we can definitely do better.
Pramod: Yes. Especially the duality of the operational data storage stores the data and does things, but it's like a state machine, and what analytics needs is every change in state. It doesn't need the end state,it needs every change in this state. How do I give that data? Do I keep it inside my API and when someone asks, give me an object and all its state changes, or do I tell someone that, "Hey, this object changes this state and I only keep the latest state." That's a decision someone has to make, and that decision then leads down to the path of like, "Does the API keep the history or does the API event out history and always keep the current copy?"
That I think is a good point to bring out about design, trade-offs of where do I make the decision, and that decision probably leads you to think about the concept of object history, event history and stuff like that, being maintained in the operation store, or the operational system or the operational system offloading that to some other place, and that system keeps track of all the other places, all the other things associated with that option.
Rebecca: One of the other things that became much more of a topic of conversation, at least, as we looked at these different persistence models, was this whole notion of eventual consistency. With relational databases, you didn't have to talk about eventual consistency. It was either there or not. I wonder if you can talk a little bit about how these more nuanced conversations about eventual consistency and in particular, where you have some aspects of your application that, in fact, do have to have the hard transactional boundaries, whereas others can deal more readily with this state of eventual consistency. Have we gotten better at dealing with that, or are people still afraid of this idea of eventual consistency?
Pramod: There are still discussions that do happen that when we talk about eventual consistency, people are afraid of making the choice back to the product people or making the choice back to the business that, "Hey, five minutes." This may not be consistent. Some teams are afraid of making that statement, but having said that there is like nowadays you can see higher usages of search engines, for example, like elastic search or solar backed search engines, and people are okay that if I create something and immediately search, it may not show up or things like that.
There is eventually a little bit of acceptance, I would say, of the fact that stuff may not show up on time or things like that. One of the good things I heard somewhere that even humans deal with eventual consistency all the time. Not just computer systems, humans deal with them all the time. It's easier to convince people with human examples instead of just talking about system examples. That's what I think we should be doing as architects to convince people that humans also are eventually consistent ultimately and that I think is a good thing.
Martin: Yes. I've always argued the trade-off between consistency and responsiveness because it's usually that, is a business decision, and run many situations where you've got to say, "Okay, how do your business want to respond? Do you want to allow little inconsistency in order that you can take that booking of a hotel room, even though it might be the last hotel room and it gets double booked, do you want to accept it anyway and accept, and assume will deal with it like later or do you not?"
These are often business choices, not technical choices, and we have to deal with that as we deal with everything else, better communication through the business side. In many ways, the eventual consistency has always been a feature Of businesses, it's just not been acknowledged so much because it's been something outside of the realms of the data. Now, as IT spread to itself more deeply into every part of a business operation, we can't ignore that trade-off.
Zhamak: I know in your book, you briefly touched on different modes of storage and different modes of modeling data. One of those is graph databases, which is very close to my heart. I love graph modeling. Can you share what your experiences been with graph databases, their application, from modes of using graph or articulating graph and guiding the audience way to apply it?
Pramod: Sure. Graph databases are, in some ways, a little harder to understand. How do you model, like, is everything a node or is everything an edge or should I put properties on edge or is the property a node? It's a tricky thing to model and it takes some learning, the learning curve to graph modeling is a little difficult. In the beginning, you may say, oh, all the higher-level entities are nodes and all the relationships between those things are edges.
Then at some point later you figure out that, oh, I need properties on these edges. Then people start putting properties on the edges itself and which is a little cumbersome and we doesn't carry enough query ability from the graph side. Then eventually, when you mature, you'll start thinking about, oh, do those properties on the edges, do they need to become nodes? Things like that.
I would encourage people to do like 1 or 2, sample applications or POCs to figure out how much maturity you have in the modeling. Many a times once you get to a mature state, everything starts looking like a graph. That's a trick with graph databases is everything starts looking like a graph. Then at that point, you literally have to stop yourself at some point like, I don't want to convert my financial application into a graph database.
I think there are many ways. One of the good ways I generally think about is, do I care about the interactions between the entities? Do I care about how do these entities interrelate in their interactions? Customer A made a payment to customer B and customer B also got a payment from customer C and they were for products, XYZ, whatever they're. If you care about the interactions happening here, like what kind of product did customer A buy, what kind of product did customer B buy, and what are the relationships? Maybe it's timing, maybe it's location. If you are interested in that relationships, it's good to think about that in graph terms.
If you don't care about it, if you only care about how many customers bought car product A then probably you don't care about a graph. You have to think about those relationships between the entities and do you care about those relationships? Then I think graphs become more powerful because you can traverse on a graph, you can query the relationships and things like that.
For the relational databases, even though it says relational databases, this is one of the things we talk about in the book. Even in talks is when we say relational databases, the relations are not explicit. You cannot perverse relationships in a relational database, they are created and you have to join them to travel on them. You can't make queries on those relationships as such. Think about those when you are trying to think about graph database.
Rebecca: Since the book was published, we've talked a little bit about how the additional cloud database technologies have come out. For somebody just getting into this, what's your advice on how to make sense of the landscape? Where should people start in thinking about getting a handle on the diversity of approaches for persistence and for data modeling and for querying? Where do you begin?
Pramod: Oh, that's a tough question, Rebecca. I would start at the basics. I think in the book we also mentioned in the beginning is talk about like, think about dorm and design. What are your aggregates? How are you gonna read these aggregates? How are you gonna write these aggregates and things like that. Sometimes that will lead you to do I need a key-value, do I need a document, do I need a column store, do I need a graph or do I need a time series and things like that.
Once you answer that question, then you should probably go to what product do I use for this particular storage model? If you arrive at the answer that I need a document database, then you can say, oh, should I need MongoDB? Or do I need Azure Cosmos? Or do I need ArangoDB, or do I need whatever right? I would go approach it in that step. Architecturally, what am I saving or what am I trying to process? How am I going to query that persisted object?
What other ways sometimes people miss that, oh I can use the key-value store here and later on, they say, oh, how do I index this value inside the key, so I can query on the value part. Then probably misread the whole aggregate concept there and you probably need to rethink now, what kind of aggregate you need? Figure out what kind of model you need first, and then go from there. Many a time people are already on some cloud provider so maybe start your search within the cloud part of us and if it doesn't give you what you want then expand it further further. Martin, do you have any better perspective?
Martin: No I would save money with the same understand what the shape of the data is and whether that leads you into one of the structures that you mentioned. Understand what the access patterns are, how many people are reading it, how many people are writing it, what kind of demand you need in it, does that affects both your choice of the model, but also obviously, the technologies.
Understand your technologies in the platform that you are, if you're on a particular cloud platform, understand what's there, and making sure that it was the best you have. Again, if you hide your data storage behind APIs, you've got a relatively good range of flexibility that you can use to cope with changes should you need to make changes later on and also to use different data models for different purposes.
You can store data in, say, aggregate style database, like a document database, but then expose some of that data that's in the document as tabular data for when people need to manipulate the tabular side and then in order to find out which documents they need to look at. There's a lot of power of tables, people like tables. You can see the way people will relentlessly fit into spreadsheets, things that should never go inside spreadsheets. Just because tables are a natural way that people can see things. That's one of the advantages that the relational model has, is that the table is a way in which people naturally think of to look at things.
Rebecca: Well, thank you both, Pramod, Martin, for joining us again. Thank you, Zhamak, for the insightful questions. I hope you've all enjoyed this discussion as much as we've enjoyed having this discussion. Thank you all.
Care to read a free chapter?
If you missed this book originally, why not check out this free chapter. It's a great introduction into the powerful world of NoSQL.
Read more and explore
NoSQL means not only SQL, implying that when designing a software solution or product, there are more than one storage mechanism that could be used based on the needs.
In the early 2010s we saw a wave of interest in alternative database technologies, that described themselves as "NoSQL". In this article Martin explains the significant role of NoSQL in any data architecture.
A short read on the patterns of data migrations that work in NoSQL databases.
Martin is an author and a speaker who is particularly interested in how software can be designed so we can easily add useful capabilities for many years. In 2000, he joined Thoughtworks, where his role is to learn about the techniques that we've learned to deliver software for our clients, and pass these techniques on to the wider software industry.
He has written a number of books on software development, including Refactoring and Patterns of Enterprise Application Architecture.