June | 2012 | r e a l - p r o g r a m m e r

Aren’t data in document-oriented databases de-normalized by definition?

De-Normalization

In a nutshell, de-normalization in a relational database system is the re-structuring of the data in such a way that portions are combined into one relation that would be in separate relations when fully normalized. De-normalization is in some sense the reverse process of normalization.

In relational systems, de-normalization is conceptually unnecessary and actually detrimental to achieving the ideal data model. However, de-normalization comes into the picture because of non-functional reasons like performance or schema change management. Just briefly (and roughly), data stored in one relation are faster retrieved compared to the same data that needs to be joined together first. Changing a schema by adding a column is easier then by creating a new relation, adding foreign key relationships and retrieve the data by join.

Document De-Normalization

Is there actually a concept of de-normalization in document-oriented databases? Yes, there is. Every time a decision is made to store documents into sub-collections of other documents, then a de-normalization decision is made. In principle, the documents could have been stored in a separate collection also (this excludes part-of relationships; see an earlier blog). For example, for each supplier the part it supplies can be in one of its sub-collections. An alternative de-normalization could be the set of suppliers for each part in a sub-collection of the part.

Discussion

So one might ask: ‘So, what? That’s what the document model is made for.’ Yes, the document model inherently allows to build hierarchical structures. As discussed earlier, the part-of relationship is a good example where this applies naturally. However, each de-normalization decision must not be done randomly. In the case of parts and suppliers it might look like that both de-normalized collections as outlined above are fine.

However, that’s not totally true. In the end the document access patterns play an important role. If the suppliers are listed in sub-collections of parts, and a list of suppliers needs to be queried, the de-normalization will cause quite some processing. Not only have all parts documents be searched, but also each supplier has to be included into the result set only once. If suppliers do not have a unique identifier, an index does not help much.

There might be cases where a de-normalization actually does not improve the situation. For example, if parts and suppliers are accessed equally, then only an estimation of query frequency and query result processing will tell if a de-normalization is advantageous at all.

The part-of relationship is an interesting relationship in context of document-oriented data models and requires a separate discussion, albeit short.

Part-Of Relationship

The part-of relationship is between two objects. There is the owning object (owner), and the object that is part of the owner (part). A part is tightly linked to the owner, meaning, that the part cannot exist on its own but only when the owner is in place at the same time. An owner can have many parts, including zero. From an object model this means that

a part can only be created and related to an object (owner) if that object exists (either already or is coming into existence simultaneously with the part)
a part can be deleted without the owner having to be deleted
if an owner is deleted, all its parts will be deleted, too (parts cannot exist on their own)

The relationship between a part and its owner is therefore not symmetrical. But a part cannot exist on its own and therefore this asymmetric relationship must be properly enforced in order to ensure data consistency.

(Of course, a variant is possible, where the part can exist on its own. In this case the part has an identity of its own. On top, if the part can only be part of one owner, there then must be a reservation model that reserves/releases the part so that it cannot be reused twice. This variant is often also modeled as a role: ‘object plays the role of a part in another object’. But that variant is non-special in this context, so it will not be discussed further).

Relational Model

If a part is scalar then it is usually represented as a separate column in the row representing the owner. If a part is represented in a separate table (or set of tables) then the row that represents the owner has a foreign key referring to the part. Representing parts in their own tables is usually necessary when the part is complex and cannot be represented in a scalar.

Removing a scalar part is done by setting it to ‘null’ (or some other specifically called out value representing absence). If the owner is deleted, the scalar parts are automatically deleted as part of the same row.

If a part is a relation (or several relations) then the owner refers to the part through a foreign key. However, if the owner is deleted the system has to ensure that its part(s) will be deleted also. If a relational database management system does not have the functionality to supervise and enforce this (cascading deletion) automatically, the application systems have to ensure this data consistency.

Document-oriented Model

In a document-oriented model complex parts can be implemented analogously to the relational approach. A part can be a separate document from the owner document and the owner refers to the part (by document identifier or other unique identifier).

However, in the document-model it is also possible to add the part (no matter how complex it is) as a property or collection inside the document representing the owner. In this case, if the owner is removed, the part will be removed with it automatically. The supervision and enforcement is therefore guaranteed.

Summary

While it is possible to implement the part-of relationship similarly in the relational and document-oriented model, the document-model also provides the option to have complex parts inside the document representing the owner. From a modeling perspective and from a design perspective this is a great option to have.

From a processing perspective this means that accessing the owner automatically provides all the parts. This can be really useful if the parts are needed when the owner is accessed. However, if the parts are always separately needed, then making them part of the owner document might cause too much data movement upon access. So the alternative modeling approach of having the parts and the owners separate is more advantageous in this case.

r e a l – p r o g r a m m e r

http://www.real-programmer.com

Monthly Archives: June 2012

Relational Data in a Document-oriented NoSQL Database (Part 5): De-normalization