What’s in a JSON Document Property Name?

What is a good choice of property names in JSON documents? And what is a good (semantic) modeling approach?

The Issue

Property names are not only random names; to a large extent they indicate the semantics to humans of what a document contains and represents. Choosing the right property name, and especially property value, its very important for several reasons, like understanding and processing. Two examples will highlight the issues and trigger the discussion.

BadSupplier Example

This is a possible way to describe a supplier:

{"name":"BadSupplier",
 "rating": 0.1,
 "bolts": [{"steel": 25},
           {"aluminum": 55}],
 "nuts": [{"steel": 24},
          {"aluminum": 56}]
}

This document describes a supplier selling two types of parts, bolts and nuts. And each type that is available is made from two different materials, with the number of the available parts of that material being maintained. A supplier has a rating, too.

This is a perfectly valid document, but it potentially has problems. But before discussing those, let’s look at another example.

GoodSupplier Example

{"name":"GoodSupplier",
 "rating": 1.0,
 "parts": [{"type": "bolt", "material":"steel", "availability":25},
           {"type": "bolt", "material":"aluminum", "availability":55},
           {"type": "nut", "material":"steel", "availability":24},
           {"type": "nut", "material":"aluminum", "availability":56}]
}

This is also a perfectly valid document. One obvious difference is that to describe GoodSupplier a lot more properties are used compared to BadSupplier, almost twice as many. But there are other differences that become clear when looking at how the documents are queried.

BadSupplier vs. GoodSupplier: Discussion

Upfront, in reality, it is not only about choosing good property names, but also it is about data modeling (or information modeling), namely, how is the data structured along the criteria for good structuring and modeling. Of course, ‘good’ is in the eye of the beholder, so the discussion here is as seen from my viewpoint.

The best way to have the discussion in my opinion is looking at queries that might be interesting.

  • What type of parts does a supplier sell?
    • GoodSupplier: find all distinct values of ‘type’ from each document in the sub-collection ‘parts’.
    • BadSupplier: find all distinct property names from each document, except ‘name’ and ‘rating’.

This query highlights the fact that in the GoodSupplier example, the product types are values, whereby in the BadSupplier example the product types are property names. If different suppliers supply different parts, the query for the GoodSupplier example stays the same, plus it also works across all suppliers. The assumption is that there is a sub-collection ‘parts’ and each part has a ‘type’.

In the BadSupplier example case, each supplier might supply different parts, and so all property names have to be collected from each document (except ‘name’ and ‘rating’). However, it can very well be the case that different suppliers have additional properties (like e.g. Address, Customers, etc.), so this means that those have to be known upfront and filtered out document by document.

Dynamic schema changes can also be very difficult. In the GoodSupplier example, the only two assumptions are that there is a property ‘type’ and a sub-collection ‘parts’, other properties are not relevant for this query. In the BadSupplier example, any new property that will be added to a supplier must be made known and categorized as part type (or not) so that queries can exclude them.

A real problem occurs if the same property name is representing a product type for one supplier, but not a product type for another. In this case the property names representing part types would have to be managed per document.

  • How many parts are available for each type?
    • GoodSupplier: find all documents with the same ‘type’ and add up the value of the property ‘availability’.
    • BadSupplier: for all values of all properties (except ‘name’ and ‘rating’), sum up the values of the properties ‘steel’ and ‘aluminum’.

This query highlights that in the GoodSupplier example, the availability for all types can be added up with the same query, no matter what the part types are. In the BadSupplier example, not only is it important to know the properties, but also all materials in order to only add up numbers that represent part availability.

  • Find all suppliers that have more than 1000 parts on hand.
    • GoodSupplier: for each supplier, add up ‘availability’ for each part and if sum is greater than 1000, emit the supplier name.
    • BadSupplier: for each supplier, determine the property names that represent parts, and for each of those, find the properties that represent materials. Add up the availability and determine if the sum is greater than 1000.

This query, like the previous ones, requires in the GoodSupplier example case a single query across all suppliers whereas in the BadSupplier example requires to understand the part names for each supplier.

What’s Going on?

In one sentence and in relational terms: In the BadSupplier example for each part and material there is one column. In the GoodSupplier example, there is one row for each part.

In NoSQL terms: in the BadSupplier example, the query must analyze the schema for each supplier document in order to determine the appropriate property names and their values. In the GoodSupplier example, the query must only access property values to determine the query result since there is a schema assumption in place.

Why did I use the labels ‘Bad’ and ‘Good’? Because both are not equal alternatives from my viewpoint. Here are additional reasons aside from the query formulation.

  • Extensibility. Adding part types does not affect the queries in the GoodSupplier example and neither in the BadSupplier example. Adding more non-part properties will fail the queries for the BadSupplier example. So dynamic schema changes are a problem in the BadSupplier example.
  • Query Language Support: Query languages (such as the one MongoDB provides), use property names as selectors and retrieve property values. Retrieving property names and conditionally excluding property names is not directly supported. So any operation that requires property name management must be processed outside the database in the client (increasing the complexity substantially).

Summary

While there is no hard and fast rule on how to model document properties, following the idea of the relational model and applying it seems more supportive of the notion of dynamic schema changes and the existing query support.

By ‘following the relational model’ I mean storing instance data as property values, not property names (aka, in rows, not in columns).

Advertisements