Document Projection (Part 1): MongoDB

This blog reviews some of the projection functionality that the MongoDB query interface provides. The emphasis is comparing projection of embedded object properties with projection of embedded array elements. Those are not symmetric, as the examples will show, and that is surprising and rather unexpected.

Test Data Set

The initial test data set contains three documents:

{}
{"a": 1, "b": 2}
{"c": {"d": 3, "e": 4, "f": {"g": 5, "h": 6}}}

The empty document is the control document, and two documents have properties whereby one of those has several levels of embedding. The test data set, contained in a file “td.txt”, is loaded into MongoDB as follows:

mongoimport -d projection -c proj td.txt

Projecting Document Properties

Let’s observe a few projection queries:

> db.proj.find({}, {_id: 0, a: 1})
{  }
{ "a" : 1 }
{  }

In general, to make the result less verbose, the property “_id” is suppressed. The query asked for property “a” and for each document the property “a” is returned (whereby two documents do not contain “a”, so the resulting documents are empty).

> db.proj.find({}, {_id: 0, "c.e": 1})
{  }
{  }
{ "c" : { "e" : 4 } }

Reaching into documents is done by using the dot notation. “c.e” asks for the property “e” in “c”. The result contains the whole document structure starting at “c” all the way to “e”. Alternatively MongoDB could have returned {“e”: 4} only, but then the result would not correspond to the dot notation in the query.

> db.proj.find({}, {_id: 0, "c.f": 1})
{  }
{  }
{ "c" : { "f" : { "g" : 5, "h" : 6 } } }

No surprise here as the result contains the property “c.f”.

> db.proj.find({}, {_id: 0, "c.f.h": 1})
{  }
{  }
{ "c" : { "f" : { "h" : 6 } } }

Reaching further into the document works as expected also.

> db.proj.find({}, {_id: 0, a: 1, c: 1})
{  }
{ "a" : 1 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }

Asking for different properties returns all of those for each document.

> db.proj.find({}, {_id:0, "c.e":1, "c.f.h":1})
{  }
{  }
{ "c" : { "e" : 4, "f" : { "h" : 6 } } }

Asking for different properties in the same embedded document returns a combination of the properties, not separate properties. This makes sense as in each document a property with a given name can only appear once.

> db.proj.find({}, {_id: 0, "c.g": 1})
{  }
{  }
{ "c" : {  } }

This query asks for an embedded property that does not exist. However, the properties on the path to that missing property are actually included in the result. This is surprising to me as I would expect that if a property does not exist, no property is included in the result at all, especially not a partial property, so to say. In this case, the path in the query “c.g” does not match any path in the result.

Projecting Array Element Properties

First of all, another document is added to the test data set that contains arrays and nested document that in turn contain an array:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}]}

So a total of 4 documents are in the test data set now.

Projection of arrays is done using the “$slice” operator described here: http://docs.mongodb.org/manual/reference/operator/projection/slice/ (in addition to predicate-driven selection/projection using “$” or “$elemMatch”). Let’s try.

> db.proj.find({}, {_id: 0, "x": {$slice: 1}})
{  }
{ "a" : 1, "b" : 2 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }
{ "x" : [ { "y" : 10 } ] }

This asks for one element from the array “x”, and indeed, the first element is returned. However, all properties of every document not containing “x” are returned also. This is surprising and in contrast to the behavior when projecting properties of embedded documents.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: 1}})
{ "x" : [ { "y" : 10 } ] }

The “$exists” operator limits the documents to those that contain “x” only. This is still not the same semantics as in the document property projection, but closer.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: [1, 1]}})
{ "x" : [ { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

This selects the second array element (the first “1” indicates the number of skips, and the second “1” indicates how many elements should be selected.

This is different from selecting the second property, in my opinion, as in case of projecting the second property it would be important to see in the result that the second property was projected. This definitely debatable, by in analogy to projecting embedded document properties, the result would have to reflect the query.

Let’s try to select the second element of “w”. This requires reaching into the array on the first level.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.1.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

The approach using the dot notation fails. The query does not honor “x.1.w”, specifying: project from “w”, which is the second array element “x.1”. However, the interface is not giving an error, either, as it probably should?

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }

This works. MongoDB seem to automatically interpret this correctly. However, the first array element of “x” is returned also, again in contrast to the approach for document properties where properties that are not on they path will not be returned.

But what if “x” would contain an additional array element with property “w”? Let’s add this document:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}, {"w": [{"z": 13}, {"v": 14}]}]}

Now 5 documents are in the test data set.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] }, { "w" : [ { "v" : 14 } ] } ] }

Turns out, MongoDB selects all documents that contain a property “w”, and from those the second element. This looks reasonable.

But, how is the second element of the first “w” selected? I think multi-level projection in context of arrays is not possible at this point: https://jira.mongodb.org/browse/SERVER-831.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.q": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "w" : [ { "z" : 13}, { "v" : 14 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "a" : 20 } ] }

Asking for an non-existing property returns the complete array, again, in contrast to the analogous query in embedded documents.

I’ll stop here at examining projecting array elements as clearly there is limited support for it as this point in time.

Summary

Surprisingly, at least for me, MongoDB does not follow the same design rules when projecting properties from embedded documents compared with projecting elements from embedded arrays. Not only is the behavior different

  1. Properties of documents not containing the requested array projection are returned
  2. Properties not on the path to the projected element are returned

but also multi-level projection does not work in a straight forward way: it is not possible to use dot notation to reach into nested arrays for projection.

Querying for non-existing properties results in partial or incorrect results, in my opinion. Of course, there are different viewpoints possible on this behavior, and for sure it warrants further discussion.