JSON Graph Structures

As discussed in another blog on this site, JSON is by-value only and does not have reference types built in. So how are graph (reference) structures represented?

Graph Structures

In short, a graph (http://en.wikipedia.org/wiki/Graph_(data_structure)) has nodes and edges between nodes. They might be directional or not directional. A graph might have nodes without any edges. A graph might have cycles or might be acyclic. Independent of the type of graph, how is a graph represented in JSON?

Graph Node Identity and Edges

Before discussing the graph representations themselves, a precursor discussion on edges is in order: identity and existence.

  • Identity. Edges point at nodes and originate from nodes. An edge pointing to a node must be able to identify that node uniquely. One option is to add a unique identifier to every node. An edge pointing to that node then refers to this unique identifier. An alternative approach is to determine if one or several existing properties of a node uniquely identify that node (“composite key”). In the case an edge can refer to such a uniquely identifying set of properties and values.
  • Existence. An edge originates at a node and points to a node; so an edge has a relationship to two nodes. One possibility is to represent an edge explicitly as an object, and that object internally has properties that indicates the two nodes. This approach represents the edge explicitly. Alternatively, edges can be implicit by the edge data being part of the nodes themselves. An edge pointing from node A to node B could be implemented by identifying B in A to indicate that there is an edge from A to B and in B identifying A to indicate that an edge is originating from A to B. In this case the edge is not represented explicitly.

The choice of identifier and edge representation depends on the circumstances. If a composite key is available, adding unique identifiers is not necessary. If edges have to carry their own information (e.g. edge weight), their explicit representation might be more appropriate if this would make their processing easier.

Although easily to assume, graph nodes do not have to be top level documents, they can be embedded documents also. In the latter case, the identification might be more difficult as the embedding location might have to be taken into consideration also.

Alternative Graph Representations in JSON

There are different alternatives representing graphs in JSON. Some of those will be briefly discussed next. The list is by no means exhaustive and specific circumstances might warrant different representations.

Normalized Graph Representation

The most normalized representation is to represent graph nodes explicitly as top-level documents and graph edges explicitly as top level documents. In this case there are as many documents as there are nodes and edges (independent of how the identity is achieved).

In this approach, nodes might contain references to their edges or not. Edges might or might not contain references to the nodes. It is possible that nodes refer to their edges and the edges to their nodes. Any duplication makes the access faster, but the update will require more effort to ensure consistency.

Non-Normalized Graph Representation

In the non-normalized case, either nodes or edges are co-located within a documents. Some possibilities are

  • Node contains its outgoing edges. In this case, the edges are not explicit, but within the node from which the edge originates. This makes accessing outgoing edges fast, but in order to determine incoming edges, all nodes have to be inspected.
  • Node has its incoming edges. This is the opposite case; accessing the incoming edges would be fast, the outgoing would require traversal of all nodes.
  • Node has incoming and outgoing edges. In this case the look-up of edges from a node’s viewpoint is straight forward and fast, however, any change in edges has to be applied to two nodes. So the read access is fast, but the update access requires more effort.
  • Node has complete sub-graph reachable from it. This makes the traversal super-fast if started from a given node as the whole sub-graph is in the node. If the edges have no direction, this would mean the whole graph is in every node (unless the graph is partitioned). At the same time, this means that every node has a potentially huge amount of redundant data representing parts of the graph or even the whole graph.

The reverse can be modeled also:

  • Edges contains related nodes. In this case the nodes are sub-documents of edges and are not represented on their own. If there are many edges that share many nodes, this causes quite a duplication of data. If the graph is rather sparse and separated, this might be quite efficient.

A final approach could be to model the graph as a whole:

  • Complete graph in one document. In this case the whole graph is within one document. Any operation on the graph has to access only this one document. A sub-case would be that any non-graph related data on nodes and edges is outside this one document. So it only contains the graph structure, but all the ‘payload’ of nodes and edges would be in separate documents. This nicely separates the processing of the graph from the processing of node payloads.

There are many addition variations possible in addition to this list. The major point is that the access patterns (like payload, graph operations, etc.) play a major role in deciding on the final structure.

Conclusion

Graphs are important data structures and due to their particular complexity and context in which they are being used in it is not possible to derive to the one best graph representation. It is really necessary to contemplate the use of a graph in a specific context and model it in JSON accordingly.

JSON Type Extensions

JSON has only a minimal set of data types. What are possible ways to actually extend it and add data types?

Example: ‘Date’

The basic (scalar) types provided by JSON are: ‘null’, ‘true’, ‘false’, ‘string’ and ‘number’ (http://www.json.org). For example, JSON does not have a basic or scalar type for ‘date’.

For the discussion of data type extensions, as date example the date 7/29/2012 is used. One possibility is simply to represent “7/29/2012” as a string and use the ‘string’ type. If this is the case, consumers have to be aware that dates are actually encoded as strings. And, a consumer now has to inspect every string and try to see if it can be interpreted as a date.

Aside from the processing effort, what if the provider has intended to be a string and not a date? What if the consumer has a different interpretation in mind that would require it to be “29-07-2012”? What if a consumer is not aware that dates are encoded as strings?

JSON’s Data Type System is a Closed System

JSON has a fixed set of scalar data types and no extension mechanism to add additional scalar data types. There are two mechanisms to define complex structures: objects and arrays. However, this mechanism does not allow to define new data types. It only supports setting the value of a property to a complex structure (which does not have a type).

So effectively, JSON’s data type system is closed and cannot be extended by new data types.

The Problem: JSON Data Type Extensions

In the absence of an extensible type system, this bears the question: how is a type like ‘date’ encoded? The goal is that the sender and the receiver of a JSON object containing a date actually both interpret date as date and misinterpretations are avoided.

Or, the more general form of the question is: how are types represented that are not natively represented in JSON?

The blog https://realprogrammer.wordpress.com/2012/07/18/json-is-strictly-by-value/ discussed possible problems and solutions to the by-value and by-reference representation. Here, in contrast, the discussion is about data type representation for those types that do not have a direct equivalent in JSON itself.

Possible Approaches of Data Type Encoding in JSON

There are different approaches of implementing the data type ‘date’ in JSON. A few basic approaches are discussed in the following.

  • Naming Convention. In this approach the property name is augmented with additional characters to indicate that its value contains a date. For example {“BirthDay_D”:”7/29/2012″}. All systems that write/read this property must be aware of the ‘_D’ designator and use it properly. In addition, when property names are used for search or display, the ‘_D’ must be stripped off. For database queries, it must be present, of course.
  • Value Formatting Rules. In this approach the type is ‘hidden’ in the value based on formatting rules. For example, dates could be formatted as “<month>/<day>/<year>”. This rule has to be known by all systems that read or write dates. In addition, every time a system reads a string, it has to check if the string is formatted according to one of the potentially many formatting rules it knows.
  • Value Tagging. This is a variation of the value formatting rules. Instead of formatting the value in a specific way, the value is pre-pended with an encoding: {“BirthDay”:”#date#7/29/2012″}. This separates the designation ‘date’ from the particular formatting. There are problems, though, with this approach, too, as the systems accessing this value must parse it properly. In terms of queries, the proper designator has to be added to e.g. search terms coming from the user interface.
  • Objects With Type Designator. One of the cleanest ways is to separate the property name, the value and it’s type. Since these are two pieces of data, an object is required to properly represent it. For example {“BirthDay”:”7/29/2012″, “#type”:”date”}. In this case the type is denoted separately in a property called ‘#type’. The value of this property states the particular type, here ‘date’. This mechanism allows any number of types whereby the type names are property values of ‘#type’. The downside is that properties with non-JSON types are objects. However, this is well supported by JSON libraries and packages and does not require special processing like naming conventions or value formatting/tagging rules.

There are many more intricate variations possible and discussed. Over time a few might crystallize and picked up by so many systems that a convention establishes itself. At that point the problem would be solved for real.

There is also an approach that I would consider a ‘not-so-good idea’. This approach uses a property name as the type designator. For example:

{"date":"7/29/2012"}

This approach has several severe problems. The first problem is that with this approach, an object can only have at most one date. If two dates are necessary, further properties and objects have to be used to encode that case. Second, code accessing the properties of an object has to understand that there is possibly a large set of property names that are ‘reserved’ or have specific meaning, namely, being type names. If 25 types are introduced, there would be 25 property names representing types and any client or consumer of these JSON documents have to be aware of it. And finally, property names are usually carrying some semantics, like a ‘birth date’. If a property name is used as designator, then the fact that a date is a birth date has to be encoded separately, making the structure a lot more complex, in addition to forcing client implementations to be more complex.

Key: Common Understanding

All approaches rely on a common understanding of how to interpret the type extension that is not encoded in JSON itself. So a specification has to be agreed upon by the producer and consumer of the type extensions in order to ensure a common understanding.

The degree of common understanding, however, varies considerably. Formatting rules are a lot harder to specify and comply to compared to a single property name ‘#type’ and an enumeration of possible type names like ‘date’.

Existing Approaches

What do ‘real’ systems actually do? For example, MongoDB has the following approach: http://www.mongodb.org/display/DOCS/Mongo+Extended+JSON. This system provides three different approaches, with only one catering to pure JSON. In this case, unfortunately, they follow the ‘no-so-good idea’ approach by using property names as type designator. The common understanding of the value representation is externally defined by BSON (http://bsonspec.org/).