Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

Advertisement

Document Collections: Size and Processing Complexity

“My collection is 100 GB in size, can you process it?”. What does that mean and what is really significant about it?

Collection Count and Sizes

A collection of 100GB can mean many things. For example, it could be a collection with one document of the size of 100GB. Or a collection of 100 documents, with 1GB in size each.

MongoDB, for example, has a document size upper limit of 16MB (http://www.mongodb.org/display/DOCS/Documents). Documents larger than 16MB have to be split into several documents (or stored outside the document model).

Does this all matter? From a storage perspective probably less as the storage needs to be able to store the data (in this case 100GB). If 100GB is broken down into smaller documents, then this requires more storage due to the additional management data needed, but that is insignificant.

From a processing perspective it definitely matters.

Processing Complexity

One measure of processing complexity is the number of operations executed per document. For example, if a property is read from each document, it takes O(n) operations (one per document), meaning the complexity is linear with the number of documents.

One possible optimization here is if that particular property is indexed. In this case the complexity is sub-linear if the database does not have to access every document for the property value by can use the index instead.

If more complex operations have to be executed for each document (like counting sub-collection documents, or adding up values within a sub-collection), then the complexity increases by a constant factor.

If documents are divided up and have to be joined together before they can be processed (e.g because of size limits or de-normalization), then the processing complexity might increase significantly.

What to ask for?

The storage size matters, but it is not a good measure for the processing effort. What matters for processing effort is the number of documents, existing indexes, as well as any required “combination” due to de-normalization in addition to the actual computation to be performed on documents.

Next time the size is characterized by a storage measure, ask also for the measures that determine the processing complexity.

Joins: (Almost) Impossible to Avoid in Document-oriented Databases

There is a lot of ‘chatter’ about the concept and support of joins in document-oriented databases. So what is the underling issue?

Joins in RDBMS

‘Join’ in the relational world is an operation on two relations that relates the tuples in these relations with each other based on some comparison criteria on the tuples’ attributes. For example, the comparison can be ‘R1.a = R2.b’ and so for each tuple from the first relation R1 all tuples from the second relation R2 are retrieved and combined that match the comparison, meaning, the attribute ‘a’ must match the attribute ‘b’. A detailed discussion can be found here: http://en.wikipedia.org/wiki/Join_%28SQL%29.

Joins allow to relate data from different relations and the join operator is supported by a relational database management system. A typical use case is to find all parts that a supplier supplies. And, for a given part, find all its suppliers. The suppliers and parts are usually stored in different relations and the data have an m:n relationship with each other.

Joins across Documents?

So why the chatter, then? If a document-oriented database stores data in different document collections and if the documents need to be related to each other, then a join is in order. The example of suppliers and parts applies here in the exact same way.

Now, if a document-oriented database does not support joins, what to do? Well, in reality the join will be performed in some layer above the database in a programming language. If all suppliers have to be displayed for a given part, then a program that computes this result effectively implements a join; it is not done in the database, though.

Pre-joined data in Documents?

Some optimization is possible. If the access pattern follows an 80-20 rule, then document-oriented databases allow some hard-coded optimization. If in 80% of the cases the suppliers for a part are requested, and only in 20% the opposite, then the designer of the document layout could create for each part document a sub-collection ‘supplier’ that contains the suppliers of this part. In 80% of the cases no join is necessary any more as the suppliers are ‘pre-joined’ with the parts they supply, only in the 20% of the cases a join is necessary.

However, this causes what in the relational world is called anomalies: If a supplier is removed, then all part documents have to be searched for this supplier. Or if a supplier is added, then all those part documents have to be updated that are supplied by this supplier. Updating supplier data also requires to search the part documents. Pre-joining is effectively a specific de-normalization activity for performance reasons.

Does the type of relationship matter?

Are there relationships that by their nature can be pre-joined without penalty? A very specific relationship, the part-of relationship, falls into this category. It is a ‘clean’ approach since the life time of the part-of objects are the exact same as the containing object.

Another relationship that feels as if pre-joining makes sense is the 1:1 relationship where two objects are exclusively related to each other. However, this is not really the case as one object would be a property of the other and that then could be done the other way around, too. So the 80-20 rule case applies here, too.

In reality, however, relationship between data are usually a lot more complex then just part-of relationships. This in turn means that joins will be necessary. The only real exception is if the 80-20 rule is really a 100-0 rule. This would mean that all access are the exact same and no joins are necessary.

Underlying Conceptual Foundation

Conceptually as soon as independent entities (i.e. objects in their own right) are related to each other, and if their relationship is traversed in both directions at some point in time during the execution of the application, a join is necessary and factually taking place.

Pre-joining is the materialization of the traversal in one direction. So two pre-joins, one for each direction, are possible. If the pre-join in both directions takes place, no join has to be performed upon retrieval; however, the join functionality was applied at time of update or insert in order to accomplish the pre-joins.

As soon as pre-joins exist, possible update, insert and delete anomalies have to be carefully taken care of as pre-joins are the equivalent to de-normalization and therefore data redundancy. At insert, update and delete time all redundant copies of the objects have to be found and the appropriate functionality applied.

Pre-joins are for read-performance reasons only; they are not a conceptual matter and in fact cause additional work at insert, update or delete time instead; so the computational work shifted, but is not avoided.

Note

‘Join’ is a database operator. The same functionality can be implemented in application code outside or ‘on top’ of the database. Most likely the method or function is not called ‘join’ even though it in fact implements that functionality. So be aware of the situation that a document-oriented database does not implement a join and the engineers claim not to need one. The functionality of a join might just be there under a different name.

Relational Data in a Document-oriented NoSQL Database (Part 5): De-normalization

Aren’t data in document-oriented databases de-normalized by definition?

De-Normalization

In a nutshell, de-normalization in a relational database system is the re-structuring of the data in such a way that portions are combined into one relation that would be in separate relations when fully normalized. De-normalization is in some sense the reverse process of normalization.

In relational systems, de-normalization is conceptually unnecessary and actually detrimental to achieving the ideal data model. However, de-normalization comes into the picture because of non-functional reasons like performance or schema change management. Just briefly (and roughly), data stored in one relation are faster retrieved compared to the same data that needs to be joined together first. Changing a schema by adding a column is easier then by creating a new relation, adding foreign key relationships and retrieve the data by join.

Document De-Normalization

Is there actually a concept of de-normalization in document-oriented databases? Yes, there is. Every time a decision is made to store documents into sub-collections of other documents, then a de-normalization decision is made. In principle, the documents could have been stored in a separate collection also (this excludes part-of relationships; see an earlier blog). For example, for each supplier the part it supplies can be in one of its sub-collections. An alternative de-normalization could be the set of suppliers for each part in a sub-collection of the part.

Discussion

So one might ask: ‘So, what? That’s what the document model is made for.’ Yes, the document model inherently allows to build hierarchical structures. As discussed earlier, the part-of relationship is a good example where this applies naturally. However, each de-normalization decision must not be done randomly. In the case of parts and suppliers it might look like that both de-normalized collections as outlined above are fine.

However, that’s not totally true. In the end the document access patterns play an important role. If the suppliers are listed in sub-collections of parts, and a list of suppliers needs to be queried, the de-normalization will cause quite some processing. Not only have all parts documents be searched, but also each supplier has to be included into the result set only once. If suppliers do not have a unique identifier, an index does not help much.

There might be cases where a de-normalization actually does not improve the situation. For example, if parts and suppliers are accessed equally, then only an estimation of query frequency and query result processing will tell if a de-normalization is advantageous at all.

Relational Data in a Document-oriented NoSQL Database: Overview

This blog starts a series of discussions around the topic of storing relational data in a document-oriented NoSQL database. Each discussion is going to be a separate blog post.

The idea is not to promote storing relational data in a document-oriented database as such. The goal of this series of blogs is to rationalize (to large extent on a blog-level granularity) the relationship between relational and document data and how a relational world can meet a document world (and vice versa).

The starting point of the discussion are the topics in the blog https://realprogrammer.wordpress.com/2012/04/25/relational-database-management-system-rdbms/:

  1. Universal Relation: https://realprogrammer.wordpress.com/2012/05/16/relational-data-in-a-document-oriented-nosql-database-part-1-universal-relation/
  2. Schema: https://realprogrammer.wordpress.com/2012/05/23/relational-data-in-a-document-oriented-nosql-database-part-2-schema/
  3. Normalization: https://realprogrammer.wordpress.com/2012/05/30/relational-data-in-a-document-oriented-nosql-database-part-3-normalization/
  4. Part-Of Relationship: https://realprogrammer.wordpress.com/2012/06/06/relational-data-in-a-document-oriented-nosql-database-part-4-part-of-relationship/
  5. De-Normalization: https://realprogrammer.wordpress.com/2012/06/13/relational-data-in-a-document-oriented-nosql-database-part-5-de-normalization/

More topics might be added as the discussion unfolds.

Based on these discussions I expect that some guidelines will emerge of how to ‘model’ documents in document-oriented databases ‘properly’ and useful criteria for this modeling task. It is also going to be interesting to see if there is a way to determine the relationship between a relational model and a document model in a consistent way.

Modeling documents in a schema-less/dynamic schema world sounds like an oxymoron; however, in the end, transactional applications and analysis software have to access documents and a ‘good’ modeling practice will certainly help those systems in their design and operation.

Relational Database Management System (RDBMS)

Why a blog on Relational Database Management System (RDBMS)? Isn’t there enough said about this type of database? Yes, there is. This post is to call out a few aspects of the relational model that I want to refer to in context of RDBMSs for subsequent discussions. In addition, it re-establishes terminology.

Normalization

Briefly, normalization attempts to remove data duplication and data dependencies so that an update of a data item has to be done only once for the whole data set to be consistent again (http://en.wikipedia.org/wiki/Database_normalization).

Universal Relation

The Universal Relation is mostly an assumption that states: it is possible to store all data in one big, wide table. For a discussion, see Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press 1989, ISBN 0-7167-8162-X.

In a way, the universal relation is the ‘opposite’ of a perfectly normalized relational schema. It also assumes that the semantically same attributes are named the same way in different relations. As a consequence, many columns might have null values. But for the thought experiment, this is perfectly valid.

Part-Of Relationship

The part-of relationship consists between an owning object and an owned one whereby the owned object’s life cycle is tightly bound to the owning one: if the owning object is deleted, the owned one will be deleted, too. The owned object exists in context of the owning object and does not have a life on its own.

In the relational model there is a choice how this can be modeled. In principle, the owned objects can be in their own relation, separate from the owning objects in their own relation. In addition, there is a foreign key relationship from the owning to the owned relationship.

Alternatively, the owned objects can be in the same relation as the owning objects. Both would be in the same row. This is basically the case when the owned object is a scalar domain value. In this case it is almost automatically in the same relation as the owning data.

De-Normalization

De-normalization is about combining data that would otherwise be in separate relations when being fully normalized (http://en.wikipedia.org/wiki/Denormalization). This introduces redundancy and dependencies in the model for the sake of access improvement and access speed.

In a sense, the universal relation is the ultimate form of de-normalization.

Schema

In the relational database management system world the data are stored as rows in tables. A table provides a fixed structure whereby every column defines one and only one data type. Any row stored in a table, therefore, must comply to the types of the columns. The only exception is ‘null’ since ‘null’ can be of any data type. The significance is that tables are well-defined in terms of data types and rows have to comply to the table definition. This makes it a strictly typed system: the set of all table definitions constitute the schema.

Schema changes are possible in most RDBMS implementations. It is possible to add, to change or to delete columns. One consequence is that all rows in the table have to change so that their values comply to the new column definitions. Adding and removing tables is possible, too. The ‘hard’ part about changing a table is that all rows have to be changed accordingly; it is not possible to only have the change apply to future rows or when rows are updated.

Summary

This blog recalled a few important concepts from the relational model and relational database management systems. These will be referred to and used in future blogs.