Document Collections: Size and Processing Complexity

“My collection is 100 GB in size, can you process it?”. What does that mean and what is really significant about it?

Collection Count and Sizes

A collection of 100GB can mean many things. For example, it could be a collection with one document of the size of 100GB. Or a collection of 100 documents, with 1GB in size each.

MongoDB, for example, has a document size upper limit of 16MB (http://www.mongodb.org/display/DOCS/Documents). Documents larger than 16MB have to be split into several documents (or stored outside the document model).

Does this all matter? From a storage perspective probably less as the storage needs to be able to store the data (in this case 100GB). If 100GB is broken down into smaller documents, then this requires more storage due to the additional management data needed, but that is insignificant.

From a processing perspective it definitely matters.

Processing Complexity

One measure of processing complexity is the number of operations executed per document. For example, if a property is read from each document, it takes O(n) operations (one per document), meaning the complexity is linear with the number of documents.

One possible optimization here is if that particular property is indexed. In this case the complexity is sub-linear if the database does not have to access every document for the property value by can use the index instead.

If more complex operations have to be executed for each document (like counting sub-collection documents, or adding up values within a sub-collection), then the complexity increases by a constant factor.

If documents are divided up and have to be joined together before they can be processed (e.g because of size limits or de-normalization), then the processing complexity might increase significantly.

What to ask for?

The storage size matters, but it is not a good measure for the processing effort. What matters for processing effort is the number of documents, existing indexes, as well as any required “combination” due to de-normalization in addition to the actual computation to be performed on documents.

Next time the size is characterized by a storage measure, ask also for the measures that determine the processing complexity.

JSON Collections: Self-referring Documents (aka Recursive Collection(s))

Recursive collections of documents might be necessary to model a specific domain data set. How can this be done?

Example: Parent-Child Relationship

One of the standard examples of self-referring data structures is the parent-children relationship. People have parents and might have children. Parents are always children themselves. Children might have children also. This example is recursive, and it has the interesting property that the knowledge about people is not complete.

Normalized Collection: References Only

The most normalized representation in a document-oriented world is one document for each person. So if a person is known or needs to be represented, a document is added to the system for that person with the data of that person.

This leaves the discussion of how to model the relationships between people. There are several alternatives (and this is not an exhaustive list):

  1. Has-parents relationships as separate document. In this case the relationship has-parents can be implemented as a document that has three properties in it: person, mother, father. (Possible implementations of references are discussed here: https://realprogrammer.wordpress.com/2012/08/17/json-graph-structures/). So every person has possibly a corresponding document representing its parents. An alternative design can be that the relationship is split into has-mother and has-father (represented as two different documents with two properties each).
  2. Has-children relationship as separate documents. The has-children relationship is more complicated. If a person has children, then the question arises, is there one document per child or one document that contains the reference to all children? One complication is that a child has two parents and not all children of a person have to be with the same partner. A document per child is easier to structure then a document for all children if the partner has to be referenced also. If that information is not necessary, having all children referenced in one document is reasonable as well.

The alternative representations contain the same semantics from a data modeling viewpoint, however, their computational impact might be different. For example, if the 80% case is finding the children of a person, having all children as one document is more efficient in general.

What would be even more efficient in this case would be a non-normalized approach where e.g. the references to the children of a person are within the document of that person.

Non-Normalized Collection: Combination of References and Embedding

To move from a normalized representation to a non-normalized one is in many cases related to the computational effort upon retrieval of the documents. There might be other reasons, too. What are the alternatives?

In principle, non-normalization in a document world means to represent relationships not as documents themselves, but as sub-collections inside documents. This is moving from a by-reference representation to a by-value representation. In the above case, the references to children of a person are added as a sub-collection “children” inside the document of that person. The upside is that accessing the children of a person is quick, the downside is that finding the parents of a child means to search through sub-collections.

This leads to an interesting consequence. If relationships have to be traversed in both directions, explicit representations as documents make that easy as each document representing a relationship can be interpreted in both directions. In the non-normalized case, this is not given any more as the entries in sub-collections are uni-directional. Therefore, in order to make the traversal in both directions easier, it might help if each child has a property “parent” added for the opposite direction.

Furthermore, if not only the children have to be found in the 80% case, but also their complete data set has to be fetched every time, it might be even more efficient to embed the whole documents of the children in the sub-collection “children”. This makes read-access fast, but duplicates the document of a child. So the penalty is the increased effort necessary for updates of child data in at least two places.

What follows from here is more of the same. If always grand-children have to be determined, but not children, the transitive closure for that level can be explicitly stored. With all the same pros and cons.

And, of course, the same discussion applies to the parent and grand-parent relationship.

Summary

It is possible, of course, to store and manage recursive relationships in document-oriented databases. The alternatives of normalized and non-normalized representation need to be carefully explored in terms of read-access and update effort and efficiency. All alternatives are valid under the set of access patterns that are expected by the information system.

In JSON land, recursive collections are a special case of graph structures, as discussed here: https://realprogrammer.wordpress.com/2012/08/17/json-graph-structures/.

Relational Data in a Document-oriented NoSQL Database (Part 2): Schema

The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that?

Relational Table vs. Document Collection

Terminology-wise, relational tables are in the domain of relational database management systems. Tables contain data in form of rows. Document collections are in the domain of some NoSQL databases that support the document structure (e.g., MongoDB). Document collections contain data in discrete documents. Document collections are used as the basis for the following discussion.

One or Several Schemas?

Upfront, document oriented databases have some form of schema built-in. First, there is the concept of collections. Before any document can be stored, a collection must be in place. Second, data to be stored in collections must be documents complying to a document data structure, in many cases this is JSON.

These constraints means that each document is structured according to JSON. If a given document is seen as an instance of a document schema, then each document actually complies to that schema. That schema is implicit as it is not externalized and represented separately. If each document in a collection is different, then there are as many (implicit) schemas as there are documents. If all documents have the same internal structure, then all comply to the same implicit schema.

In the general case, for a given collection, there can be as many implicit schemas as there are documents. In the minimal case, there is no schema (if the collection is empty) or one schema if all documents comply to the same implicit schema. In contrast, in the relational tables, all rows always comply to the table definition.

Schema Enforcement

A relation enforces the structure of its rows. In contrast, a collections does not enforce the structure of its documents. As long as a document is in a consistent data type (e.g. JSON), it can be stored in any collection.

In a given collection it is possible that all documents have the same internal structure. In this case they would all follow the same schema. In the extreme case, each document has its own structure not shared by any other document and that then means that each document has its own schema. A collection does not enforce the schema of its documents.

If document schemas have to be enforced, it can only be done outside the document database, either by the code that writes documents to collections (database inbound) or by the code that reads documents from collections (database outbound). In the inbound case, this ensures that all documents are actually of a given structure. In the outbound case, documents that do not comply will never be processed or they will be changed on the fly in order to comply. Alternatively, the reading ocde can throw an exception if it finds a non-compliant document and then the document can be processed in order to make it compliant.

Querying and Programming Model

In a relational world database queries can assume that all rows in a table are of the same structure and of the appropriate type. The programs accessing the database and retrieving tuples can assume the same. From a programming model perspective there is no variation as the structure is fixed and therefore predictable.

Accessing a document-oriented database is different as neither the queries nor the accessing programs can assume any particular document structure (as the database systems does not enforce any structure). Assumptions can only be made if a fixed document schema or a fixed set of variations is enforced elsewhere by convention. This requires that the programming model is extremely aware of the potential variety and has to understand how to deal with this variety.

Discussion

With all these dynamic possibilities, it is therefore no problem to store relational data into a document-oriented database management system. A document-oriented database management system can deal with structured data, even though it cannot enforce the structure. Even in the case that relations change over time can be handled by a document-oriented database system as it allows documents changing their shape over time.

Coming back to the initial question: The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that? The answer might lie in the ease that allows to store documents with different implicit schemas in the same collection.

Big caveat: Storing documents of different schemas is ‘easy’. However, that pushes the complexity of dealing with documents of different implicit schemas to the programming model: it has to be able to deal with the variation. And for the most part, I believe this is uncharted territory as there are no programming language constructs that allow to characterize variation during processing.