Negation and complement are not so straight forward in a multi-schema world where different documents in a collection follow different schemas. Let’s start with an example.
Example: Count()
> db.nc.save({"a":"value", "b":"v1"}) > db.nc.save({"a":"value", "b":null}) > db.nc.save({"a":"value"})
Now let’s count how many documents are there with ‘b’ having a value of either ‘null’ or not ‘null’.
> db.nc.find({"$or":[{"b":null}, {"b":{$ne:null}}]}).count() 3
Is 3 the correct result?
Basing the query on $type instead results into the same count:
> db.nc.find({"$or":[{"b":{$type:10}}, {"b":{$not:{$type:10}}}]}).count() 3
Is that a correct count?
Interpretation of ‘null’
The query system of MongoDB interprets a query for a property with value ‘null’ in a specific way: if the property is present, then it is checked if the property has the value ‘null’. If the property is not present, then it is interpreted as being present with the value of ‘null’.
In this interpretation, the above query results are correct.
Alternative Interpretation of ‘null’
If only those documents should be considered that actually have the property present, the queries have to be extended to check for that.
> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":null}, {"b":{$ne:null}}]}]}).count() 2
and
> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":{"$type":10}}, {"b":{$not:{"$type": 10}}}]}]}).count() 2
In order to change this interpretation an additional condition is added that checks for the existence of the property being queried. Luckily, the desired interpretation was possible through changing the query without requiring and code.
Looking Beyond ‘null’
Investigating the behavior around ‘null’ was a relatively small effort as it is quite compact. However, with additional operators like ‘$lt’, ‘$gt’, etc., more situations with specific interpretations can be found and have to be carefully planned through when implementing.
There are three basic rules
- If every document has the same schema (meaning, the same properties with the same names and types), then queries can be written without caveats wrt. multi-schema collections.
- If documents have different schemas in the sense the some contain a property and some do not contain that property, then queries should be written in such a way that the absence of a property does not cause a wrong result or a wrong interpretation.
- The same is true for documents that have the same properties, but different types (like the property is sometimes a string and sometimes a number). In this case the queries should be written in such a way that the different types do not lead to a wrong interpretation of the result.
A combination of three cases is possible, meaning, that, for example, all documents have a property ‘p1’ of the same type, a property ‘p2’ that is sometimes not present and a property ‘p3’ that is sometimes not present and sometimes is of different types.
The only alternative to being really careful with the query construction and interpretation is to transform all documents so that they all follow the same schema (global schema for that collection). The particular context and situation determines what the best approach is.
So what’s the Big Deal?
The issues around queries in a multi-schema collection are probably less of a big deal if you are in complete control of your code. In this case you can make sure yourself that the correct interpretation is followed everywhere (by encapsulation, constraint checking, code review or mandate).
However, if you are using libraries as-is and you do not have the ability to change them, you need to make sure that their internal interpretation matches yours. If not, errors can sneak easily into your code base.
On top, sometimes the result determined by code are checked by running queries on the command line or shell. It is not always obvious if e.g. an ‘$exists’ was applied in the code or not. So the ‘checking’ of results using the command-line has its own challenges.