JSON Type Extensions

JSON has only a minimal set of data types. What are possible ways to actually extend it and add data types?

Example: ‘Date’

The basic (scalar) types provided by JSON are: ‘null’, ‘true’, ‘false’, ‘string’ and ‘number’ (http://www.json.org). For example, JSON does not have a basic or scalar type for ‘date’.

For the discussion of data type extensions, as date example the date 7/29/2012 is used. One possibility is simply to represent “7/29/2012” as a string and use the ‘string’ type. If this is the case, consumers have to be aware that dates are actually encoded as strings. And, a consumer now has to inspect every string and try to see if it can be interpreted as a date.

Aside from the processing effort, what if the provider has intended to be a string and not a date? What if the consumer has a different interpretation in mind that would require it to be “29-07-2012”? What if a consumer is not aware that dates are encoded as strings?

JSON’s Data Type System is a Closed System

JSON has a fixed set of scalar data types and no extension mechanism to add additional scalar data types. There are two mechanisms to define complex structures: objects and arrays. However, this mechanism does not allow to define new data types. It only supports setting the value of a property to a complex structure (which does not have a type).

So effectively, JSON’s data type system is closed and cannot be extended by new data types.

The Problem: JSON Data Type Extensions

In the absence of an extensible type system, this bears the question: how is a type like ‘date’ encoded? The goal is that the sender and the receiver of a JSON object containing a date actually both interpret date as date and misinterpretations are avoided.

Or, the more general form of the question is: how are types represented that are not natively represented in JSON?

The blog https://realprogrammer.wordpress.com/2012/07/18/json-is-strictly-by-value/ discussed possible problems and solutions to the by-value and by-reference representation. Here, in contrast, the discussion is about data type representation for those types that do not have a direct equivalent in JSON itself.

Possible Approaches of Data Type Encoding in JSON

There are different approaches of implementing the data type ‘date’ in JSON. A few basic approaches are discussed in the following.

  • Naming Convention. In this approach the property name is augmented with additional characters to indicate that its value contains a date. For example {“BirthDay_D”:”7/29/2012″}. All systems that write/read this property must be aware of the ‘_D’ designator and use it properly. In addition, when property names are used for search or display, the ‘_D’ must be stripped off. For database queries, it must be present, of course.
  • Value Formatting Rules. In this approach the type is ‘hidden’ in the value based on formatting rules. For example, dates could be formatted as “<month>/<day>/<year>”. This rule has to be known by all systems that read or write dates. In addition, every time a system reads a string, it has to check if the string is formatted according to one of the potentially many formatting rules it knows.
  • Value Tagging. This is a variation of the value formatting rules. Instead of formatting the value in a specific way, the value is pre-pended with an encoding: {“BirthDay”:”#date#7/29/2012″}. This separates the designation ‘date’ from the particular formatting. There are problems, though, with this approach, too, as the systems accessing this value must parse it properly. In terms of queries, the proper designator has to be added to e.g. search terms coming from the user interface.
  • Objects With Type Designator. One of the cleanest ways is to separate the property name, the value and it’s type. Since these are two pieces of data, an object is required to properly represent it. For example {“BirthDay”:”7/29/2012″, “#type”:”date”}. In this case the type is denoted separately in a property called ‘#type’. The value of this property states the particular type, here ‘date’. This mechanism allows any number of types whereby the type names are property values of ‘#type’. The downside is that properties with non-JSON types are objects. However, this is well supported by JSON libraries and packages and does not require special processing like naming conventions or value formatting/tagging rules.

There are many more intricate variations possible and discussed. Over time a few might crystallize and picked up by so many systems that a convention establishes itself. At that point the problem would be solved for real.

There is also an approach that I would consider a ‘not-so-good idea’. This approach uses a property name as the type designator. For example:

{"date":"7/29/2012"}

This approach has several severe problems. The first problem is that with this approach, an object can only have at most one date. If two dates are necessary, further properties and objects have to be used to encode that case. Second, code accessing the properties of an object has to understand that there is possibly a large set of property names that are ‘reserved’ or have specific meaning, namely, being type names. If 25 types are introduced, there would be 25 property names representing types and any client or consumer of these JSON documents have to be aware of it. And finally, property names are usually carrying some semantics, like a ‘birth date’. If a property name is used as designator, then the fact that a date is a birth date has to be encoded separately, making the structure a lot more complex, in addition to forcing client implementations to be more complex.

Key: Common Understanding

All approaches rely on a common understanding of how to interpret the type extension that is not encoded in JSON itself. So a specification has to be agreed upon by the producer and consumer of the type extensions in order to ensure a common understanding.

The degree of common understanding, however, varies considerably. Formatting rules are a lot harder to specify and comply to compared to a single property name ‘#type’ and an enumeration of possible type names like ‘date’.

Existing Approaches

What do ‘real’ systems actually do? For example, MongoDB has the following approach: http://www.mongodb.org/display/DOCS/Mongo+Extended+JSON. This system provides three different approaches, with only one catering to pure JSON. In this case, unfortunately, they follow the ‘no-so-good idea’ approach by using property names as type designator. The common understanding of the value representation is externally defined by BSON (http://bsonspec.org/).

Advertisements

JSON is strictly By-Value

JSON is an externalization format, not a programming language data type implementation. Why is this relevant?

JSON

JSON (http://www.json.org) is an ASCII representation of data. It provides base data type representations and a grammar about how to structure JSON structures properly.

As for terminology, here are some synonyms often used:

  • JSON “object”: document
  • JSON “members”: properties
  • JSON “pair”: property
  • JSON “string” in “pair”: property name
  • JSON “value” in “pair”: property value

JSON is Syntax

It is syntax only. There is not data type semantics attached to it; all programming languages that process JSON define their own interpretation of the meaning of the syntax JSON defines.

JSON Semantics as such is Undefined

The semantics of JSON is not defined by JSON itself. It is undefined by the standard. Programming languages and databases have to define their (!) semantics of JSON.

For example, the JSON standard does not make a statement about unique property names in a document. In JSON terminology, an object (aka document) can have several pairs (aka properties), each having a string and a value (aka property name and property value). JSON does not constrain the property names to be unique. According to JSON it would be valid for a document to have several properties with the same name.

JSON Semantics Implementation

How is this implemented? Do systems actually allow several properties with the same name in a document? Let’s look at two of those.

MongoDB (version 2.0.2): it is possible to save a document that has the same property twice:

MongoDB shell version: 2.0.2
connecting to: test
> use json
switched to db json
> db.test.save({"a":1, "a":2})
> db.test.findOne()
{ "_id" : ObjectId("5006d58e92cca1a32772df6a"), "a" : 2 }
>

So, clearly MongoDB does not complain about the fact that a property name is stated twice. However, it opts to make a selection itself and uses the second of the two and actually stores only one. Of course, this bears the question if MongoDB on input applies other modifications as well.

JsonLint (http://jsonlint.org/) exposes the same behavior. When validating {“a”:1, “a”:2} then two things are happening: it deems the input as ‘Valid JSON’, but then it modifies the input to only¬†{“a”:2}.¬† This is bad in two ways. First, it is not clear which version is ‘valid’, and second, it is not a read-only lint! This means that if you paste in a JSON document, always check if JsonLint modified it; your document might not be valid, but JsonLint’s modification of it.

So it seems like that some systems actually do not support having two properties with the same name. But then I would suggest they flag this as error and not modify the document.

JSON and References

How does a JSON document refer to a second one? For example, an object representing a user referring to an object representing an address?

JSON does not have the notion of reference or pointer nor the notion of address or unique identifier (both necessary to make referencing work). In order to uniquely identify a document, the author of the document has to add a property that by convention is deemed to be a unique identifier (see e.g. MongoDB above, the database adds a ‘_id’ property). Secondly, a reference to such a unique identifier is a regular property that by convention has to be understood as being a reference. For example, the value 5006d58e92cca1a32772df6a above could be a value. By convention the interpretation of this property would know that the number is actually a unique identifier.

Side note: this is the same situation relational databases are in, they don’t have references either. But, in their case there is the concept of key and foreign key supervised by the database, both relying on data values. Except for the supervision functionality, JSON could be interpreted this way, too.

Coming back to the semantics for a moment. Assume a property “ref” is used to indicate that its value is a reference to another document. How is the absence of link established, meaning, the notion that a given object does not refer to another one? On possible way is to set “ref” to the value null. Alternatively “ref” could be left out. Either (or both ways) are possible, but that needs to be established.

JSON and Linked Structures

In summary, in order to establish linked structures using JSON, special properties have to be called out in order to establish addressing and referencing. JSON itself does not provide concepts for this. In addition, the JSON semantics and interpretation has to be established (either explicit by rules or implicit by the programming code).

Engineers and architectus be aware: JSON is an ASCII syntax for externalization, not a programming language data type system.