Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

Document Collections: Size and Processing Complexity

“My collection is 100 GB in size, can you process it?”. What does that mean and what is really significant about it?

Collection Count and Sizes

A collection of 100GB can mean many things. For example, it could be a collection with one document of the size of 100GB. Or a collection of 100 documents, with 1GB in size each.

MongoDB, for example, has a document size upper limit of 16MB (http://www.mongodb.org/display/DOCS/Documents). Documents larger than 16MB have to be split into several documents (or stored outside the document model).

Does this all matter? From a storage perspective probably less as the storage needs to be able to store the data (in this case 100GB). If 100GB is broken down into smaller documents, then this requires more storage due to the additional management data needed, but that is insignificant.

From a processing perspective it definitely matters.

Processing Complexity

One measure of processing complexity is the number of operations executed per document. For example, if a property is read from each document, it takes O(n) operations (one per document), meaning the complexity is linear with the number of documents.

One possible optimization here is if that particular property is indexed. In this case the complexity is sub-linear if the database does not have to access every document for the property value by can use the index instead.

If more complex operations have to be executed for each document (like counting sub-collection documents, or adding up values within a sub-collection), then the complexity increases by a constant factor.

If documents are divided up and have to be joined together before they can be processed (e.g because of size limits or de-normalization), then the processing complexity might increase significantly.

What to ask for?

The storage size matters, but it is not a good measure for the processing effort. What matters for processing effort is the number of documents, existing indexes, as well as any required “combination” due to de-normalization in addition to the actual computation to be performed on documents.

Next time the size is characterized by a storage measure, ask also for the measures that determine the processing complexity.

Relational Data in a Document-oriented NoSQL Database (Part 5): De-normalization

Aren’t data in document-oriented databases de-normalized by definition?

De-Normalization

In a nutshell, de-normalization in a relational database system is the re-structuring of the data in such a way that portions are combined into one relation that would be in separate relations when fully normalized. De-normalization is in some sense the reverse process of normalization.

In relational systems, de-normalization is conceptually unnecessary and actually detrimental to achieving the ideal data model. However, de-normalization comes into the picture because of non-functional reasons like performance or schema change management. Just briefly (and roughly), data stored in one relation are faster retrieved compared to the same data that needs to be joined together first. Changing a schema by adding a column is easier then by creating a new relation, adding foreign key relationships and retrieve the data by join.

Document De-Normalization

Is there actually a concept of de-normalization in document-oriented databases? Yes, there is. Every time a decision is made to store documents into sub-collections of other documents, then a de-normalization decision is made. In principle, the documents could have been stored in a separate collection also (this excludes part-of relationships; see an earlier blog). For example, for each supplier the part it supplies can be in one of its sub-collections. An alternative de-normalization could be the set of suppliers for each part in a sub-collection of the part.

Discussion

So one might ask: ‘So, what? That’s what the document model is made for.’ Yes, the document model inherently allows to build hierarchical structures. As discussed earlier, the part-of relationship is a good example where this applies naturally. However, each de-normalization decision must not be done randomly. In the case of parts and suppliers it might look like that both de-normalized collections as outlined above are fine.

However, that’s not totally true. In the end the document access patterns play an important role. If the suppliers are listed in sub-collections of parts, and a list of suppliers needs to be queried, the de-normalization will cause quite some processing. Not only have all parts documents be searched, but also each supplier has to be included into the result set only once. If suppliers do not have a unique identifier, an index does not help much.

There might be cases where a de-normalization actually does not improve the situation. For example, if parts and suppliers are accessed equally, then only an estimation of query frequency and query result processing will tell if a de-normalization is advantageous at all.

Relational Data in a Document-oriented NoSQL Database (Part 3): Normalization

Relational database systems support the normalization of relational schemas; what about document-oriented databases?

Normalization

Normalization is a data/schema modeling activity that (if fully applied) results in a schema that has no redundancies and the number of dependencies is minimal (see http://en.wikipedia.org/wiki/Database_normalization for a brief discussion).

In very abstract terms, any change in data is only done in one relation and through means of relationships the rest of the schema stays consistent. This means that in order to support a well-normalized schema queries will contain joins to combine data that is spread out over several tables due to the normalization process.

Document Normalization

Given the concept of normalization in relational systems: is is possible or necessary to normalize documents, too?

First of all, it is possible to normalize data that is organized in documents. The naive approach is to build up the documents so that only scalar properties are stored, roughly equating to tables. In addition, in order to establish references between documents, either foreign-key type properties or documents denoting relationships between documents are created. The process of normalization can be followed like in a relational system.

However, in general document-oriented systems lack the join query operator and so combining documents when querying is not possible. Therefore the combination of data has to be done in the application layer (usually resulting in several queries). This is definitely a down-side of trying to normalize data in document-oriented systems.

Furthermore, the idea behind using document-oriented databases is for storing data that ‘naturally’ consists of hierarchical structures that are fetched together (the 80% case) as complete hierarchies and not normalized chunks of data. And example is a blog and its comments. Comments belong to the blog and without the blog they do not make a lot of sense. In general, when a blog is queried, its comments often are required, too, and so having the blog and its comments in one document makes a lot of sense. The blog can be retrieved in one query.

However, there are cases where separating data into different documents might make sense also. For example, a company might build an e-commerce web site and provides a shopping cart. Users add items to shopping carts. A shopping cart might be represented as one document. In this case it does not make a lot of sense to store the complete item description in every shopping cart that has that item in it. It might be better to have the item name and identifier in the cart with a reference to the item description residing in a different document.

Discussion

While in the relational world it is generally advised to normalize a schema, in the document-oriented world the separation is probably driven by the 80/20 rule of data access. Meaning, to structure documents in such a way that 80% of the queries are fast and fetch all required data in one round trip. Over time, more elaborate rules might be established.

In context of MongoDB a very interesting initial discussion can be found in ’50 Tips and Tricks for MongoDB Developers’ (http://shop.oreilly.com/product/0636920019893.do). This book has a few more elaborated rules and it clearly hints that normalization is for sure an important aspect of document-oriented databases.

Relational Data in a Document-oriented NoSQL Database: Overview

This blog starts a series of discussions around the topic of storing relational data in a document-oriented NoSQL database. Each discussion is going to be a separate blog post.

The idea is not to promote storing relational data in a document-oriented database as such. The goal of this series of blogs is to rationalize (to large extent on a blog-level granularity) the relationship between relational and document data and how a relational world can meet a document world (and vice versa).

The starting point of the discussion are the topics in the blog https://realprogrammer.wordpress.com/2012/04/25/relational-database-management-system-rdbms/:

  1. Universal Relation: https://realprogrammer.wordpress.com/2012/05/16/relational-data-in-a-document-oriented-nosql-database-part-1-universal-relation/
  2. Schema: https://realprogrammer.wordpress.com/2012/05/23/relational-data-in-a-document-oriented-nosql-database-part-2-schema/
  3. Normalization: https://realprogrammer.wordpress.com/2012/05/30/relational-data-in-a-document-oriented-nosql-database-part-3-normalization/
  4. Part-Of Relationship: https://realprogrammer.wordpress.com/2012/06/06/relational-data-in-a-document-oriented-nosql-database-part-4-part-of-relationship/
  5. De-Normalization: https://realprogrammer.wordpress.com/2012/06/13/relational-data-in-a-document-oriented-nosql-database-part-5-de-normalization/

More topics might be added as the discussion unfolds.

Based on these discussions I expect that some guidelines will emerge of how to ‘model’ documents in document-oriented databases ‘properly’ and useful criteria for this modeling task. It is also going to be interesting to see if there is a way to determine the relationship between a relational model and a document model in a consistent way.

Modeling documents in a schema-less/dynamic schema world sounds like an oxymoron; however, in the end, transactional applications and analysis software have to access documents and a ‘good’ modeling practice will certainly help those systems in their design and operation.

Relational Database Management System (RDBMS)

Why a blog on Relational Database Management System (RDBMS)? Isn’t there enough said about this type of database? Yes, there is. This post is to call out a few aspects of the relational model that I want to refer to in context of RDBMSs for subsequent discussions. In addition, it re-establishes terminology.

Normalization

Briefly, normalization attempts to remove data duplication and data dependencies so that an update of a data item has to be done only once for the whole data set to be consistent again (http://en.wikipedia.org/wiki/Database_normalization).

Universal Relation

The Universal Relation is mostly an assumption that states: it is possible to store all data in one big, wide table. For a discussion, see Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press 1989, ISBN 0-7167-8162-X.

In a way, the universal relation is the ‘opposite’ of a perfectly normalized relational schema. It also assumes that the semantically same attributes are named the same way in different relations. As a consequence, many columns might have null values. But for the thought experiment, this is perfectly valid.

Part-Of Relationship

The part-of relationship consists between an owning object and an owned one whereby the owned object’s life cycle is tightly bound to the owning one: if the owning object is deleted, the owned one will be deleted, too. The owned object exists in context of the owning object and does not have a life on its own.

In the relational model there is a choice how this can be modeled. In principle, the owned objects can be in their own relation, separate from the owning objects in their own relation. In addition, there is a foreign key relationship from the owning to the owned relationship.

Alternatively, the owned objects can be in the same relation as the owning objects. Both would be in the same row. This is basically the case when the owned object is a scalar domain value. In this case it is almost automatically in the same relation as the owning data.

De-Normalization

De-normalization is about combining data that would otherwise be in separate relations when being fully normalized (http://en.wikipedia.org/wiki/Denormalization). This introduces redundancy and dependencies in the model for the sake of access improvement and access speed.

In a sense, the universal relation is the ultimate form of de-normalization.

Schema

In the relational database management system world the data are stored as rows in tables. A table provides a fixed structure whereby every column defines one and only one data type. Any row stored in a table, therefore, must comply to the types of the columns. The only exception is ‘null’ since ‘null’ can be of any data type. The significance is that tables are well-defined in terms of data types and rows have to comply to the table definition. This makes it a strictly typed system: the set of all table definitions constitute the schema.

Schema changes are possible in most RDBMS implementations. It is possible to add, to change or to delete columns. One consequence is that all rows in the table have to change so that their values comply to the new column definitions. Adding and removing tables is possible, too. The ‘hard’ part about changing a table is that all rows have to be changed accordingly; it is not possible to only have the change apply to future rows or when rows are updated.

Summary

This blog recalled a few important concepts from the relational model and relational database management systems. These will be referred to and used in future blogs.