JSON is strictly By-Value

JSON is an externalization format, not a programming language data type implementation. Why is this relevant?

JSON

JSON (http://www.json.org) is an ASCII representation of data. It provides base data type representations and a grammar about how to structure JSON structures properly.

As for terminology, here are some synonyms often used:

  • JSON “object”: document
  • JSON “members”: properties
  • JSON “pair”: property
  • JSON “string” in “pair”: property name
  • JSON “value” in “pair”: property value

JSON is Syntax

It is syntax only. There is not data type semantics attached to it; all programming languages that process JSON define their own interpretation of the meaning of the syntax JSON defines.

JSON Semantics as such is Undefined

The semantics of JSON is not defined by JSON itself. It is undefined by the standard. Programming languages and databases have to define their (!) semantics of JSON.

For example, the JSON standard does not make a statement about unique property names in a document. In JSON terminology, an object (aka document) can have several pairs (aka properties), each having a string and a value (aka property name and property value). JSON does not constrain the property names to be unique. According to JSON it would be valid for a document to have several properties with the same name.

JSON Semantics Implementation

How is this implemented? Do systems actually allow several properties with the same name in a document? Let’s look at two of those.

MongoDB (version 2.0.2): it is possible to save a document that has the same property twice:

MongoDB shell version: 2.0.2
connecting to: test
> use json
switched to db json
> db.test.save({"a":1, "a":2})
> db.test.findOne()
{ "_id" : ObjectId("5006d58e92cca1a32772df6a"), "a" : 2 }
>

So, clearly MongoDB does not complain about the fact that a property name is stated twice. However, it opts to make a selection itself and uses the second of the two and actually stores only one. Of course, this bears the question if MongoDB on input applies other modifications as well.

JsonLint (http://jsonlint.org/) exposes the same behavior. When validating {“a”:1, “a”:2} then two things are happening: it deems the input as ‘Valid JSON’, but then it modifies the input to only {“a”:2}.  This is bad in two ways. First, it is not clear which version is ‘valid’, and second, it is not a read-only lint! This means that if you paste in a JSON document, always check if JsonLint modified it; your document might not be valid, but JsonLint’s modification of it.

So it seems like that some systems actually do not support having two properties with the same name. But then I would suggest they flag this as error and not modify the document.

JSON and References

How does a JSON document refer to a second one? For example, an object representing a user referring to an object representing an address?

JSON does not have the notion of reference or pointer nor the notion of address or unique identifier (both necessary to make referencing work). In order to uniquely identify a document, the author of the document has to add a property that by convention is deemed to be a unique identifier (see e.g. MongoDB above, the database adds a ‘_id’ property). Secondly, a reference to such a unique identifier is a regular property that by convention has to be understood as being a reference. For example, the value 5006d58e92cca1a32772df6a above could be a value. By convention the interpretation of this property would know that the number is actually a unique identifier.

Side note: this is the same situation relational databases are in, they don’t have references either. But, in their case there is the concept of key and foreign key supervised by the database, both relying on data values. Except for the supervision functionality, JSON could be interpreted this way, too.

Coming back to the semantics for a moment. Assume a property “ref” is used to indicate that its value is a reference to another document. How is the absence of link established, meaning, the notion that a given object does not refer to another one? On possible way is to set “ref” to the value null. Alternatively “ref” could be left out. Either (or both ways) are possible, but that needs to be established.

JSON and Linked Structures

In summary, in order to establish linked structures using JSON, special properties have to be called out in order to establish addressing and referencing. JSON itself does not provide concepts for this. In addition, the JSON semantics and interpretation has to be established (either explicit by rules or implicit by the programming code).

Engineers and architectus be aware: JSON is an ASCII syntax for externalization, not a programming language data type system.

Advertisement

Joins: (Almost) Impossible to Avoid in Document-oriented Databases

There is a lot of ‘chatter’ about the concept and support of joins in document-oriented databases. So what is the underling issue?

Joins in RDBMS

‘Join’ in the relational world is an operation on two relations that relates the tuples in these relations with each other based on some comparison criteria on the tuples’ attributes. For example, the comparison can be ‘R1.a = R2.b’ and so for each tuple from the first relation R1 all tuples from the second relation R2 are retrieved and combined that match the comparison, meaning, the attribute ‘a’ must match the attribute ‘b’. A detailed discussion can be found here: http://en.wikipedia.org/wiki/Join_%28SQL%29.

Joins allow to relate data from different relations and the join operator is supported by a relational database management system. A typical use case is to find all parts that a supplier supplies. And, for a given part, find all its suppliers. The suppliers and parts are usually stored in different relations and the data have an m:n relationship with each other.

Joins across Documents?

So why the chatter, then? If a document-oriented database stores data in different document collections and if the documents need to be related to each other, then a join is in order. The example of suppliers and parts applies here in the exact same way.

Now, if a document-oriented database does not support joins, what to do? Well, in reality the join will be performed in some layer above the database in a programming language. If all suppliers have to be displayed for a given part, then a program that computes this result effectively implements a join; it is not done in the database, though.

Pre-joined data in Documents?

Some optimization is possible. If the access pattern follows an 80-20 rule, then document-oriented databases allow some hard-coded optimization. If in 80% of the cases the suppliers for a part are requested, and only in 20% the opposite, then the designer of the document layout could create for each part document a sub-collection ‘supplier’ that contains the suppliers of this part. In 80% of the cases no join is necessary any more as the suppliers are ‘pre-joined’ with the parts they supply, only in the 20% of the cases a join is necessary.

However, this causes what in the relational world is called anomalies: If a supplier is removed, then all part documents have to be searched for this supplier. Or if a supplier is added, then all those part documents have to be updated that are supplied by this supplier. Updating supplier data also requires to search the part documents. Pre-joining is effectively a specific de-normalization activity for performance reasons.

Does the type of relationship matter?

Are there relationships that by their nature can be pre-joined without penalty? A very specific relationship, the part-of relationship, falls into this category. It is a ‘clean’ approach since the life time of the part-of objects are the exact same as the containing object.

Another relationship that feels as if pre-joining makes sense is the 1:1 relationship where two objects are exclusively related to each other. However, this is not really the case as one object would be a property of the other and that then could be done the other way around, too. So the 80-20 rule case applies here, too.

In reality, however, relationship between data are usually a lot more complex then just part-of relationships. This in turn means that joins will be necessary. The only real exception is if the 80-20 rule is really a 100-0 rule. This would mean that all access are the exact same and no joins are necessary.

Underlying Conceptual Foundation

Conceptually as soon as independent entities (i.e. objects in their own right) are related to each other, and if their relationship is traversed in both directions at some point in time during the execution of the application, a join is necessary and factually taking place.

Pre-joining is the materialization of the traversal in one direction. So two pre-joins, one for each direction, are possible. If the pre-join in both directions takes place, no join has to be performed upon retrieval; however, the join functionality was applied at time of update or insert in order to accomplish the pre-joins.

As soon as pre-joins exist, possible update, insert and delete anomalies have to be carefully taken care of as pre-joins are the equivalent to de-normalization and therefore data redundancy. At insert, update and delete time all redundant copies of the objects have to be found and the appropriate functionality applied.

Pre-joins are for read-performance reasons only; they are not a conceptual matter and in fact cause additional work at insert, update or delete time instead; so the computational work shifted, but is not avoided.

Note

‘Join’ is a database operator. The same functionality can be implemented in application code outside or ‘on top’ of the database. Most likely the method or function is not called ‘join’ even though it in fact implements that functionality. So be aware of the situation that a document-oriented database does not implement a join and the engineers claim not to need one. The functionality of a join might just be there under a different name.

Schema Derivation: Schema-less vs. Schema-based Documents

In the world of document-oriented databases there is the notion that documents are ‘schema-less’ because no schema has to be defined and registered with the database before documents can be stored. Sometimes this is also referred to as ‘schema-free’. Let’s investigate this further: does ‘schema-free’ really exist?

JSON

In MongoDB, for example, documents use the JSON (http://www.json.org/) structure (with an extended set of types as defined in BSON http://bsonspec.org/). So all document in the database follow the same underlying data type structure. The database will reject a document that is not JSON/BSON compliant.

Schema Enforcement vs. Schema Freedom

In a relational database the notion of schema means that the database knows about the schema (e.g. structure of a table) and all data stored must follow that structure. The database management system is the enforcing entity of the schema. Data not following the schema will be rejected.

In a document-oriented system like MongoDB the notion of schema is not known to the database and therefore not enforced by the database. Any document can have any structure as long as it is valid according to JSON.

Explicit Schema vs. Implicit Schema

In the case of relational databases, the schema is explicit as it is possible to retrieve the schema as a separate data structure from the database. Relational databases have system tables that contain their schema. So the database is aware of the schema in such a way that it can be retrieved.

In a document-oriented database it is not possible to inquire the schema as it has not been defined to the database. However, it could be the case that all documents follow a schema (that is defined outside the database). If this is the case, the schema of the documents can be called an implicit schema. It is not necessary for the database to enforce a schema for the documents to comply to one.

Schema-less, Schema-free

The fact that a document-oriented database does not enforce a schema of its documents does not mean that there is no schema. It only means that the database does not enforce it. However, it could be enforced elsewhere. For example, the software that writes the documents can enforce their schema. Or the software that reads the documents enforces the schema by only accepting those documents that comply to its expectations.

But what if a user manually defines a documents and stores it?

Individual Schema vs. Global Schema

A different viewpoint is possible. If a user stores a manually designed document then it is possible to view this document of having an individual schema. It is the only one with that schema. If the user now adds a second document that has the exact same structure, then it is possible to view these as two documents with the same implicit schema. At this point there are two documents following the same (implicit) schema. It got a bit more global, tough, as now two documents are following it. If all documents follow the same (implicit) schema, then there is an (implicit) global schema.

Different documents can follow different (implicit) schemas. In the extreme case, all documents are different from each other in their structure and each has its own schema.

Schema Derivation

JSON is a construction principle for hierarchical data structures. Each document following the JSON model is hierarchical and finite as JSON does not support (cyclic) references.

Based on the hierarchical and finite structure of JSON it is possible to derive the implicit schema of the JSON documents and make it explicit. On a very high level, the schema derivation algorithm iterates over all documents in the database and determines the schema for each of it. Documents that follow the same schema are grouped together. After all documents have been inspected, there will be one (and possibly more) schema and the associated complying documents.

Details of this algorithm will be discussed in a different blog at some later time.

Summary

Documents have a schema, there are not schema-less or schema-free. The schema of a document is implicit (and can be made explicit). Several documents can follow the same schema. While the database is not enforcing the schema, it could be enforced by a separate layer outside the database.

So, not-enforcement of a schema by the database != schema-less/schema-free.