Negation and Complement in a Multi-Schema World

Negation and complement are not so straight forward in a multi-schema world where different documents in a collection follow different schemas. Let’s start with an example.

Example: Count()

> db.nc.save({"a":"value", "b":"v1"})
> db.nc.save({"a":"value", "b":null})
> db.nc.save({"a":"value"})

Now let’s count how many documents are there with ‘b’ having a value of either ‘null’ or not ‘null’.

> db.nc.find({"$or":[{"b":null}, {"b":{$ne:null}}]}).count()
3

Is 3 the correct result?

Basing the query on $type instead results into the same count:

> db.nc.find({"$or":[{"b":{$type:10}}, {"b":{$not:{$type:10}}}]}).count()
3

Is that a correct count?

Interpretation of ‘null’

The query system of MongoDB interprets a query for a property with value ‘null’ in a specific way: if the property is present, then it is checked if the property has the value ‘null’. If the property is not present, then it is interpreted as being present with the value of ‘null’.

In this interpretation, the above query results are correct.

Alternative Interpretation of ‘null’

If only those documents should be considered that actually have the property present, the queries have to be extended to check for that.

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":null}, {"b":{$ne:null}}]}]}).count()
2

and

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":{"$type":10}}, {"b":{$not:{"$type": 10}}}]}]}).count()
2

In order to change this interpretation an additional condition is added that checks for the existence of the property being queried. Luckily, the desired interpretation was possible through changing the query without requiring and code.

Looking Beyond ‘null’

Investigating the behavior around ‘null’ was a relatively small effort as it is quite compact. However, with additional operators like ‘$lt’, ‘$gt’, etc., more situations with specific interpretations can be found and have to be carefully planned through when implementing.

There are three basic rules

  • If every document has the same schema (meaning, the same properties with the same names and types), then queries can be written without caveats wrt. multi-schema collections.
  • If documents have different schemas in the sense the some contain a property and some do not contain that property, then queries should be written in such a way that the absence of a property does not cause a wrong result or a wrong interpretation.
  • The same is true for documents that have the same properties, but different types (like the property is sometimes a string and sometimes a number). In this case the queries should be written in such a way that the different types do not lead to a wrong interpretation of the result.

A combination of three cases is possible, meaning, that, for example, all documents have a property ‘p1’ of the same type, a property ‘p2’ that is sometimes not present and a property ‘p3’ that is sometimes not present and sometimes is of different types.

The only alternative to being really careful with the query construction and interpretation is to transform all documents so that they all follow the same schema (global schema for that collection). The particular context and situation determines what the best approach is.

So what’s the Big Deal?

The issues around queries in a multi-schema collection are probably less of a big deal if you are in complete control of your code. In this case you can make sure yourself that the correct interpretation is followed everywhere (by encapsulation, constraint checking, code review or mandate).

However, if you are using libraries as-is and you do not have the ability to change them, you need to make sure that their internal interpretation matches yours. If not, errors can sneak easily into your code base.

On top, sometimes the result determined by code are checked by running queries on the command line or shell. It is not always obvious if e.g. an ‘$exists’ was applied in the code or not. So the ‘checking’ of results using the command-line has its own challenges.

Advertisements

Null, Undefined, NaN and Missing Property: Goto Considered Harmful (Part 3)

It turns out there is a lot more to discuss about MongoDB and its handling of basic and complex data types. That warrants a separate series of blogs altogether. Here is one more part in this three part series; a future new series will focus exclusively on MongoDB data types.

MongoDB and ‘undefined’

As discussed before, it is possible to store a document in MongoDB that has a property with value ‘undefined’. However, it turns out that MongoDB on inserting is changing ‘undefined’ to ‘null’. So a change in value is taking place implicitly. So while MongoDB accepts the value ‘undefined’ on its interface, storage-wise it is changed to ‘null’.

As a consequence, it is not possible to query for ‘undefined’. When querying for ‘null’ all the documents will be returned that were originally stored with the values ‘null’ as well as ‘undefined’ (as ‘undefined’ was changed to ‘null’).

Implication

The implicit change from ‘undefined’ to ‘null’ has a few important implications:

  • If an application makes a semantic distinction between ‘undefined’ and ‘null’, that distinction will not be kept when storing ‘undefined’ and ‘null’ as MongoDB will not keep those as separate values. Instead, the application must record the fact that a property value is ‘undefined’ in a different way (e.g., with an embedded document or a specific value that it interprets as ‘undefined’).
  • MongoDB applies implicit data value changes and that in specific cases the values being put in are not the values being returned when queried. ‘undefined’ is one of values that will be changed. Applications must be aware of this in order to decide if this is OK with them or not. If not, a different representation must be used by the application that does not fall into this category.

Conclusion

This 3-part mini-series started out looking at aggregation in the presence of various data types and brought up a few examples that show different interpretations when aggregating. These interpretations are essential when using MongoDB.

A side effect of the mini-series is that a larger discussion of data types and MongoDB opened up and is necessary to fully understand MongoDB’s behavior. A future blog series will address data types exclusively.

What’s in a JSON Document Property Name?

What is a good choice of property names in JSON documents? And what is a good (semantic) modeling approach?

The Issue

Property names are not only random names; to a large extent they indicate the semantics to humans of what a document contains and represents. Choosing the right property name, and especially property value, its very important for several reasons, like understanding and processing. Two examples will highlight the issues and trigger the discussion.

BadSupplier Example

This is a possible way to describe a supplier:

{"name":"BadSupplier",
 "rating": 0.1,
 "bolts": [{"steel": 25},
           {"aluminum": 55}],
 "nuts": [{"steel": 24},
          {"aluminum": 56}]
}

This document describes a supplier selling two types of parts, bolts and nuts. And each type that is available is made from two different materials, with the number of the available parts of that material being maintained. A supplier has a rating, too.

This is a perfectly valid document, but it potentially has problems. But before discussing those, let’s look at another example.

GoodSupplier Example

{"name":"GoodSupplier",
 "rating": 1.0,
 "parts": [{"type": "bolt", "material":"steel", "availability":25},
           {"type": "bolt", "material":"aluminum", "availability":55},
           {"type": "nut", "material":"steel", "availability":24},
           {"type": "nut", "material":"aluminum", "availability":56}]
}

This is also a perfectly valid document. One obvious difference is that to describe GoodSupplier a lot more properties are used compared to BadSupplier, almost twice as many. But there are other differences that become clear when looking at how the documents are queried.

BadSupplier vs. GoodSupplier: Discussion

Upfront, in reality, it is not only about choosing good property names, but also it is about data modeling (or information modeling), namely, how is the data structured along the criteria for good structuring and modeling. Of course, ‘good’ is in the eye of the beholder, so the discussion here is as seen from my viewpoint.

The best way to have the discussion in my opinion is looking at queries that might be interesting.

  • What type of parts does a supplier sell?
    • GoodSupplier: find all distinct values of ‘type’ from each document in the sub-collection ‘parts’.
    • BadSupplier: find all distinct property names from each document, except ‘name’ and ‘rating’.

This query highlights the fact that in the GoodSupplier example, the product types are values, whereby in the BadSupplier example the product types are property names. If different suppliers supply different parts, the query for the GoodSupplier example stays the same, plus it also works across all suppliers. The assumption is that there is a sub-collection ‘parts’ and each part has a ‘type’.

In the BadSupplier example case, each supplier might supply different parts, and so all property names have to be collected from each document (except ‘name’ and ‘rating’). However, it can very well be the case that different suppliers have additional properties (like e.g. Address, Customers, etc.), so this means that those have to be known upfront and filtered out document by document.

Dynamic schema changes can also be very difficult. In the GoodSupplier example, the only two assumptions are that there is a property ‘type’ and a sub-collection ‘parts’, other properties are not relevant for this query. In the BadSupplier example, any new property that will be added to a supplier must be made known and categorized as part type (or not) so that queries can exclude them.

A real problem occurs if the same property name is representing a product type for one supplier, but not a product type for another. In this case the property names representing part types would have to be managed per document.

  • How many parts are available for each type?
    • GoodSupplier: find all documents with the same ‘type’ and add up the value of the property ‘availability’.
    • BadSupplier: for all values of all properties (except ‘name’ and ‘rating’), sum up the values of the properties ‘steel’ and ‘aluminum’.

This query highlights that in the GoodSupplier example, the availability for all types can be added up with the same query, no matter what the part types are. In the BadSupplier example, not only is it important to know the properties, but also all materials in order to only add up numbers that represent part availability.

  • Find all suppliers that have more than 1000 parts on hand.
    • GoodSupplier: for each supplier, add up ‘availability’ for each part and if sum is greater than 1000, emit the supplier name.
    • BadSupplier: for each supplier, determine the property names that represent parts, and for each of those, find the properties that represent materials. Add up the availability and determine if the sum is greater than 1000.

This query, like the previous ones, requires in the GoodSupplier example case a single query across all suppliers whereas in the BadSupplier example requires to understand the part names for each supplier.

What’s Going on?

In one sentence and in relational terms: In the BadSupplier example for each part and material there is one column. In the GoodSupplier example, there is one row for each part.

In NoSQL terms: in the BadSupplier example, the query must analyze the schema for each supplier document in order to determine the appropriate property names and their values. In the GoodSupplier example, the query must only access property values to determine the query result since there is a schema assumption in place.

Why did I use the labels ‘Bad’ and ‘Good’? Because both are not equal alternatives from my viewpoint. Here are additional reasons aside from the query formulation.

  • Extensibility. Adding part types does not affect the queries in the GoodSupplier example and neither in the BadSupplier example. Adding more non-part properties will fail the queries for the BadSupplier example. So dynamic schema changes are a problem in the BadSupplier example.
  • Query Language Support: Query languages (such as the one MongoDB provides), use property names as selectors and retrieve property values. Retrieving property names and conditionally excluding property names is not directly supported. So any operation that requires property name management must be processed outside the database in the client (increasing the complexity substantially).

Summary

While there is no hard and fast rule on how to model document properties, following the idea of the relational model and applying it seems more supportive of the notion of dynamic schema changes and the existing query support.

By ‘following the relational model’ I mean storing instance data as property values, not property names (aka, in rows, not in columns).