Aren’t data in document-oriented databases de-normalized by definition?
In a nutshell, de-normalization in a relational database system is the re-structuring of the data in such a way that portions are combined into one relation that would be in separate relations when fully normalized. De-normalization is in some sense the reverse process of normalization.
In relational systems, de-normalization is conceptually unnecessary and actually detrimental to achieving the ideal data model. However, de-normalization comes into the picture because of non-functional reasons like performance or schema change management. Just briefly (and roughly), data stored in one relation are faster retrieved compared to the same data that needs to be joined together first. Changing a schema by adding a column is easier then by creating a new relation, adding foreign key relationships and retrieve the data by join.
Is there actually a concept of de-normalization in document-oriented databases? Yes, there is. Every time a decision is made to store documents into sub-collections of other documents, then a de-normalization decision is made. In principle, the documents could have been stored in a separate collection also (this excludes part-of relationships; see an earlier blog). For example, for each supplier the part it supplies can be in one of its sub-collections. An alternative de-normalization could be the set of suppliers for each part in a sub-collection of the part.
So one might ask: ‘So, what? That’s what the document model is made for.’ Yes, the document model inherently allows to build hierarchical structures. As discussed earlier, the part-of relationship is a good example where this applies naturally. However, each de-normalization decision must not be done randomly. In the case of parts and suppliers it might look like that both de-normalized collections as outlined above are fine.
However, that’s not totally true. In the end the document access patterns play an important role. If the suppliers are listed in sub-collections of parts, and a list of suppliers needs to be queried, the de-normalization will cause quite some processing. Not only have all parts documents be searched, but also each supplier has to be included into the result set only once. If suppliers do not have a unique identifier, an index does not help much.
There might be cases where a de-normalization actually does not improve the situation. For example, if parts and suppliers are accessed equally, then only an estimation of query frequency and query result processing will tell if a de-normalization is advantageous at all.