Joins: (Almost) Impossible to Avoid in Document-oriented Databases

There is a lot of ‘chatter’ about the concept and support of joins in document-oriented databases. So what is the underling issue?

Joins in RDBMS

‘Join’ in the relational world is an operation on two relations that relates the tuples in these relations with each other based on some comparison criteria on the tuples’ attributes. For example, the comparison can be ‘R1.a = R2.b’ and so for each tuple from the first relation R1 all tuples from the second relation R2 are retrieved and combined that match the comparison, meaning, the attribute ‘a’ must match the attribute ‘b’. A detailed discussion can be found here: http://en.wikipedia.org/wiki/Join_%28SQL%29.

Joins allow to relate data from different relations and the join operator is supported by a relational database management system. A typical use case is to find all parts that a supplier supplies. And, for a given part, find all its suppliers. The suppliers and parts are usually stored in different relations and the data have an m:n relationship with each other.

Joins across Documents?

So why the chatter, then? If a document-oriented database stores data in different document collections and if the documents need to be related to each other, then a join is in order. The example of suppliers and parts applies here in the exact same way.

Now, if a document-oriented database does not support joins, what to do? Well, in reality the join will be performed in some layer above the database in a programming language. If all suppliers have to be displayed for a given part, then a program that computes this result effectively implements a join; it is not done in the database, though.

Pre-joined data in Documents?

Some optimization is possible. If the access pattern follows an 80-20 rule, then document-oriented databases allow some hard-coded optimization. If in 80% of the cases the suppliers for a part are requested, and only in 20% the opposite, then the designer of the document layout could create for each part document a sub-collection ‘supplier’ that contains the suppliers of this part. In 80% of the cases no join is necessary any more as the suppliers are ‘pre-joined’ with the parts they supply, only in the 20% of the cases a join is necessary.

However, this causes what in the relational world is called anomalies: If a supplier is removed, then all part documents have to be searched for this supplier. Or if a supplier is added, then all those part documents have to be updated that are supplied by this supplier. Updating supplier data also requires to search the part documents. Pre-joining is effectively a specific de-normalization activity for performance reasons.

Does the type of relationship matter?

Are there relationships that by their nature can be pre-joined without penalty? A very specific relationship, the part-of relationship, falls into this category. It is a ‘clean’ approach since the life time of the part-of objects are the exact same as the containing object.

Another relationship that feels as if pre-joining makes sense is the 1:1 relationship where two objects are exclusively related to each other. However, this is not really the case as one object would be a property of the other and that then could be done the other way around, too. So the 80-20 rule case applies here, too.

In reality, however, relationship between data are usually a lot more complex then just part-of relationships. This in turn means that joins will be necessary. The only real exception is if the 80-20 rule is really a 100-0 rule. This would mean that all access are the exact same and no joins are necessary.

Underlying Conceptual Foundation

Conceptually as soon as independent entities (i.e. objects in their own right) are related to each other, and if their relationship is traversed in both directions at some point in time during the execution of the application, a join is necessary and factually taking place.

Pre-joining is the materialization of the traversal in one direction. So two pre-joins, one for each direction, are possible. If the pre-join in both directions takes place, no join has to be performed upon retrieval; however, the join functionality was applied at time of update or insert in order to accomplish the pre-joins.

As soon as pre-joins exist, possible update, insert and delete anomalies have to be carefully taken care of as pre-joins are the equivalent to de-normalization and therefore data redundancy. At insert, update and delete time all redundant copies of the objects have to be found and the appropriate functionality applied.

Pre-joins are for read-performance reasons only; they are not a conceptual matter and in fact cause additional work at insert, update or delete time instead; so the computational work shifted, but is not avoided.

Note

‘Join’ is a database operator. The same functionality can be implemented in application code outside or ‘on top’ of the database. Most likely the method or function is not called ‘join’ even though it in fact implements that functionality. So be aware of the situation that a document-oriented database does not implement a join and the engineers claim not to need one. The functionality of a join might just be there under a different name.

Advertisements

Relational Data in a Document-oriented NoSQL Database (Part 4): Part-Of Relationship

The part-of relationship is an interesting relationship in context of document-oriented data models and requires a separate discussion, albeit short.

Part-Of Relationship

The part-of relationship is between two objects. There is the owning object (owner), and the object that is part of the owner (part). A part is tightly linked to the owner, meaning, that the part cannot exist on its own but only when the owner is in place at the same time. An owner can have many parts, including zero. From an object model this means that

  • a part can only be created and related to an object (owner) if that object exists (either already or is coming into existence simultaneously with the part)
  • a part can be deleted without the owner having to be deleted
  • if an owner is deleted, all its parts will be deleted, too (parts cannot exist on their own)

The relationship between a part and its owner is therefore not symmetrical. But a part cannot exist on its own and therefore this asymmetric relationship must be properly enforced in order to ensure data consistency.

(Of course, a variant is possible, where the part can exist on its own. In this case the part has an identity of its own. On top, if the part can only be part of one owner, there then must be a reservation model that reserves/releases the part so that it cannot be reused twice. This variant is often also modeled as a role: ‘object plays the role of a part in another object’. But that variant is non-special in this context, so it will not be discussed further).

Relational Model

If a part is scalar then it is usually represented as a separate column in the row representing the owner. If a part is represented in a separate table (or set of tables) then the row that represents the owner has a foreign key referring to the part. Representing parts in their own tables is usually necessary when the part is complex and cannot be represented in a scalar.

Removing a scalar part is done by setting it to ‘null’ (or some other specifically called out value representing absence). If the owner is deleted, the scalar parts are automatically deleted as part of the same row.

If a part is a relation (or several relations) then the owner refers to the part through a foreign key. However, if the owner is deleted the system has to ensure that its part(s) will be deleted also. If a relational database management system does not have the functionality to supervise and enforce this (cascading deletion) automatically, the application systems have to ensure this data consistency.

Document-oriented Model

In a document-oriented model complex parts can be implemented analogously to the relational approach. A part can be a separate document from the owner document and the owner refers to the part (by document identifier or other unique identifier).

However, in the document-model it is also possible to add the part (no matter how complex it is) as a property or collection inside the document representing the owner. In this case, if the owner is removed, the part will be removed with it automatically. The supervision and enforcement is therefore guaranteed.

Summary

While it is possible to implement the part-of relationship similarly in the relational and document-oriented model, the document-model also provides the option to have complex parts inside the document representing the owner. From a modeling perspective and from a design perspective this is a great option to have.

From a processing perspective this means that accessing the owner automatically provides all the parts. This can be really useful if the parts are needed when the owner is accessed. However, if the parts are always separately needed, then making them part of the owner document might cause too much data movement upon access. So the alternative modeling approach of having the parts and the owners separate is more advantageous in this case.

Relational Data in a Document-oriented NoSQL Database: Overview

This blog starts a series of discussions around the topic of storing relational data in a document-oriented NoSQL database. Each discussion is going to be a separate blog post.

The idea is not to promote storing relational data in a document-oriented database as such. The goal of this series of blogs is to rationalize (to large extent on a blog-level granularity) the relationship between relational and document data and how a relational world can meet a document world (and vice versa).

The starting point of the discussion are the topics in the blog https://realprogrammer.wordpress.com/2012/04/25/relational-database-management-system-rdbms/:

  1. Universal Relation: https://realprogrammer.wordpress.com/2012/05/16/relational-data-in-a-document-oriented-nosql-database-part-1-universal-relation/
  2. Schema: https://realprogrammer.wordpress.com/2012/05/23/relational-data-in-a-document-oriented-nosql-database-part-2-schema/
  3. Normalization: https://realprogrammer.wordpress.com/2012/05/30/relational-data-in-a-document-oriented-nosql-database-part-3-normalization/
  4. Part-Of Relationship: https://realprogrammer.wordpress.com/2012/06/06/relational-data-in-a-document-oriented-nosql-database-part-4-part-of-relationship/
  5. De-Normalization: https://realprogrammer.wordpress.com/2012/06/13/relational-data-in-a-document-oriented-nosql-database-part-5-de-normalization/

More topics might be added as the discussion unfolds.

Based on these discussions I expect that some guidelines will emerge of how to ‘model’ documents in document-oriented databases ‘properly’ and useful criteria for this modeling task. It is also going to be interesting to see if there is a way to determine the relationship between a relational model and a document model in a consistent way.

Modeling documents in a schema-less/dynamic schema world sounds like an oxymoron; however, in the end, transactional applications and analysis software have to access documents and a ‘good’ modeling practice will certainly help those systems in their design and operation.