JSON Collections: Self-referring Documents (aka Recursive Collection(s))

Recursive collections of documents might be necessary to model a specific domain data set. How can this be done?

Example: Parent-Child Relationship

One of the standard examples of self-referring data structures is the parent-children relationship. People have parents and might have children. Parents are always children themselves. Children might have children also. This example is recursive, and it has the interesting property that the knowledge about people is not complete.

Normalized Collection: References Only

The most normalized representation in a document-oriented world is one document for each person. So if a person is known or needs to be represented, a document is added to the system for that person with the data of that person.

This leaves the discussion of how to model the relationships between people. There are several alternatives (and this is not an exhaustive list):

  1. Has-parents relationships as separate document. In this case the relationship has-parents can be implemented as a document that has three properties in it: person, mother, father. (Possible implementations of references are discussed here: https://realprogrammer.wordpress.com/2012/08/17/json-graph-structures/). So every person has possibly a corresponding document representing its parents. An alternative design can be that the relationship is split into has-mother and has-father (represented as two different documents with two properties each).
  2. Has-children relationship as separate documents. The has-children relationship is more complicated. If a person has children, then the question arises, is there one document per child or one document that contains the reference to all children? One complication is that a child has two parents and not all children of a person have to be with the same partner. A document per child is easier to structure then a document for all children if the partner has to be referenced also. If that information is not necessary, having all children referenced in one document is reasonable as well.

The alternative representations contain the same semantics from a data modeling viewpoint, however, their computational impact might be different. For example, if the 80% case is finding the children of a person, having all children as one document is more efficient in general.

What would be even more efficient in this case would be a non-normalized approach where e.g. the references to the children of a person are within the document of that person.

Non-Normalized Collection: Combination of References and Embedding

To move from a normalized representation to a non-normalized one is in many cases related to the computational effort upon retrieval of the documents. There might be other reasons, too. What are the alternatives?

In principle, non-normalization in a document world means to represent relationships not as documents themselves, but as sub-collections inside documents. This is moving from a by-reference representation to a by-value representation. In the above case, the references to children of a person are added as a sub-collection “children” inside the document of that person. The upside is that accessing the children of a person is quick, the downside is that finding the parents of a child means to search through sub-collections.

This leads to an interesting consequence. If relationships have to be traversed in both directions, explicit representations as documents make that easy as each document representing a relationship can be interpreted in both directions. In the non-normalized case, this is not given any more as the entries in sub-collections are uni-directional. Therefore, in order to make the traversal in both directions easier, it might help if each child has a property “parent” added for the opposite direction.

Furthermore, if not only the children have to be found in the 80% case, but also their complete data set has to be fetched every time, it might be even more efficient to embed the whole documents of the children in the sub-collection “children”. This makes read-access fast, but duplicates the document of a child. So the penalty is the increased effort necessary for updates of child data in at least two places.

What follows from here is more of the same. If always grand-children have to be determined, but not children, the transitive closure for that level can be explicitly stored. With all the same pros and cons.

And, of course, the same discussion applies to the parent and grand-parent relationship.

Summary

It is possible, of course, to store and manage recursive relationships in document-oriented databases. The alternatives of normalized and non-normalized representation need to be carefully explored in terms of read-access and update effort and efficiency. All alternatives are valid under the set of access patterns that are expected by the information system.

In JSON land, recursive collections are a special case of graph structures, as discussed here: https://realprogrammer.wordpress.com/2012/08/17/json-graph-structures/.

Advertisements

JSON Graph Structures

As discussed in another blog on this site, JSON is by-value only and does not have reference types built in. So how are graph (reference) structures represented?

Graph Structures

In short, a graph (http://en.wikipedia.org/wiki/Graph_(data_structure)) has nodes and edges between nodes. They might be directional or not directional. A graph might have nodes without any edges. A graph might have cycles or might be acyclic. Independent of the type of graph, how is a graph represented in JSON?

Graph Node Identity and Edges

Before discussing the graph representations themselves, a precursor discussion on edges is in order: identity and existence.

  • Identity. Edges point at nodes and originate from nodes. An edge pointing to a node must be able to identify that node uniquely. One option is to add a unique identifier to every node. An edge pointing to that node then refers to this unique identifier. An alternative approach is to determine if one or several existing properties of a node uniquely identify that node (“composite key”). In the case an edge can refer to such a uniquely identifying set of properties and values.
  • Existence. An edge originates at a node and points to a node; so an edge has a relationship to two nodes. One possibility is to represent an edge explicitly as an object, and that object internally has properties that indicates the two nodes. This approach represents the edge explicitly. Alternatively, edges can be implicit by the edge data being part of the nodes themselves. An edge pointing from node A to node B could be implemented by identifying B in A to indicate that there is an edge from A to B and in B identifying A to indicate that an edge is originating from A to B. In this case the edge is not represented explicitly.

The choice of identifier and edge representation depends on the circumstances. If a composite key is available, adding unique identifiers is not necessary. If edges have to carry their own information (e.g. edge weight), their explicit representation might be more appropriate if this would make their processing easier.

Although easily to assume, graph nodes do not have to be top level documents, they can be embedded documents also. In the latter case, the identification might be more difficult as the embedding location might have to be taken into consideration also.

Alternative Graph Representations in JSON

There are different alternatives representing graphs in JSON. Some of those will be briefly discussed next. The list is by no means exhaustive and specific circumstances might warrant different representations.

Normalized Graph Representation

The most normalized representation is to represent graph nodes explicitly as top-level documents and graph edges explicitly as top level documents. In this case there are as many documents as there are nodes and edges (independent of how the identity is achieved).

In this approach, nodes might contain references to their edges or not. Edges might or might not contain references to the nodes. It is possible that nodes refer to their edges and the edges to their nodes. Any duplication makes the access faster, but the update will require more effort to ensure consistency.

Non-Normalized Graph Representation

In the non-normalized case, either nodes or edges are co-located within a documents. Some possibilities are

  • Node contains its outgoing edges. In this case, the edges are not explicit, but within the node from which the edge originates. This makes accessing outgoing edges fast, but in order to determine incoming edges, all nodes have to be inspected.
  • Node has its incoming edges. This is the opposite case; accessing the incoming edges would be fast, the outgoing would require traversal of all nodes.
  • Node has incoming and outgoing edges. In this case the look-up of edges from a node’s viewpoint is straight forward and fast, however, any change in edges has to be applied to two nodes. So the read access is fast, but the update access requires more effort.
  • Node has complete sub-graph reachable from it. This makes the traversal super-fast if started from a given node as the whole sub-graph is in the node. If the edges have no direction, this would mean the whole graph is in every node (unless the graph is partitioned). At the same time, this means that every node has a potentially huge amount of redundant data representing parts of the graph or even the whole graph.

The reverse can be modeled also:

  • Edges contains related nodes. In this case the nodes are sub-documents of edges and are not represented on their own. If there are many edges that share many nodes, this causes quite a duplication of data. If the graph is rather sparse and separated, this might be quite efficient.

A final approach could be to model the graph as a whole:

  • Complete graph in one document. In this case the whole graph is within one document. Any operation on the graph has to access only this one document. A sub-case would be that any non-graph related data on nodes and edges is outside this one document. So it only contains the graph structure, but all the ‘payload’ of nodes and edges would be in separate documents. This nicely separates the processing of the graph from the processing of node payloads.

There are many addition variations possible in addition to this list. The major point is that the access patterns (like payload, graph operations, etc.) play a major role in deciding on the final structure.

Conclusion

Graphs are important data structures and due to their particular complexity and context in which they are being used in it is not possible to derive to the one best graph representation. It is really necessary to contemplate the use of a graph in a specific context and model it in JSON accordingly.