NoSQL Databases: Data First, Schema Second? Or Vice Versa?

When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?

Global vs. Local vs. Mixed Document Schema

One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.

A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.

A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.

In terms of system examples:

  • Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
    • Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
  • MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
    • There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
  • FoundationDB [https://foundationdb.com/] follows the global schema approach.
    • FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
  • Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
    • A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.

Explicit vs. Implicit Document Schema

A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).

In terms of system examples:

  • Oracle NoSQL: supports both, explicit and implicit schemas
  • MongoDB: implicit schema
  • FoundationDB: explicit schema
  • Oracle 12c: implicit schema

From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.

The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.

Mandatory Document Schema

A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.

In terms of system examples:

  • Oracle NoSQL: no mandatory schema
  • MongoDB: no mandatory schema
  • FoundationDB: requires a mandatory schema
  • Oracle 12c: no mandatory schema

Roles of Database Users

Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:

  • Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
  • Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.

A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.

Schema First or Schema Second?

From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:

  • Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
    • local schema – implicit schema – not mandatory schema (‘schema second’)
  • Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
    • global schema – explicit schema – mandatory schema (‘schema first’)

The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.

A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.

A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.

Co-existence of Schema First and Schema Second?

In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:

  • ETL (extract – transform – load) from data collector database to data user database
    • This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
  • View on data collector database
    • This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
  • Automatic schema extraction
    • A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
  • Intelligent ORM Layer
    • An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.

From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.

If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.

Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.

Schema-Varying Languages?

It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?

While this is a huge topic on its own, server-side JavaScript might be a good place to start as the language is not based on a class/instance paradigm, but prototype approach. It’s type system is almost equivalent with JSON. The language, therefore, is able to represent documents with local schema easily and effortlessly. Since JavaScript can introspect objects and since it implements the prototype mechanism it is possible to represent local functionality for documents with a local schema. One of the bigger questions is how to represent this flexibility to the end user on user interfaces in an ergonomic way.

However, this is a discussion on its own and I’ll save it for a later blog.

Summary

When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.

So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.

Advertisement

Trending: Multi-Interface and Multi-Data-Model Databases

An interesting development, especially in the NoSQL database space, is the development towards multi-interface and multi-data-model databases, and sometimes both at the same time. While it provides flexibility, it also brings challenges.

Multi-Data-Model Support

In the relational database space, supporting different data models concurrently is not a novelty. Relational databases started off with the relational data model implementation, and later on some of the systems extended the relational model mainly by objects, XML or JSON.

Some databases in the NoSQL space are starting to evolve, too, in this manner and are providing more than one data model concurrently. Some interesting examples are discussed next.

One example in the NoSQL space is Oracle NoSQL [http://www.oracle.com/technetwork/database/database-technologies/nosqldb/]. This system supports a key/value data model whereby the key is used to identify values that are not interpreted by the database itself. In addition, values can be of complex types that are actually interpreted by the database, e.g., in secondary indexes.

Aerospike is another example in the NoSQL space [http://www.aerospike.com/]. Aerospike provides a data model consisting of basic and complex types. In addition, it supports language-native formats as well as large data types that have a specific operational characteristics and data type operations tuned for scale.

Like some relational databases extended data models over time to support specific use cases in a more direct way, some NoSQL databases are also going down that path to more directly support specific application developer needs.

Multi-Interface Support

From an application development perspective a single query API is certainly preferable that provides the complete query expressiveness required. However, especially in the new area of NoSQL databases, it is not clear yet what a good query API actually looks like. This can be observed by different systems providing different query API alternatives.

MongoDB [http://www.mongodb.org/] has a document query interface based on query patterns in form of JSON documents (“Query Documents”). In addition, it provides a map/reduce interface and aggregation pipelines. There are three different APIs that an application developer can choose, and, in addition, they overlap in their functionality. This means that, depending on the query, it can be expressed in all three of them.

Aerospike [http://www.aerospike.com/] provides different language drivers in addition to an Aerospike Query Language.

Cloudant [http://www.cloudant.com], in contrast, supports a REST-api as well as a query interface based on query documents (similar to MongoDB).

Not strictly an external interface, but very important for specific use cases, is the ability to add functionality dynamically to the database in order to move some processing from the application systems into the database itself: user defined functions. For example, MongoDB allows adding functionality through JavaScript functions, whereas Aerospike supports two different types of Lua functions: record user defined functions operate on single records, whereby stream user defined functions support distributed processing.

The Good

Unquestionably, the good part about multi-interface and multi-data-model databases is that an application developer can choose the best combination of data model and access interface for a particular development task. The impedance mismatch between the problem and the solution can be minimized with an appropriate choice.

This also means that developers need to understand the pros and cons of every combination and that requires to go through a learning curve. Going through that learning curve might pay off big time.

In addition, application development teams will have to manage a wider range of implementation variations in terms of application design and engineering, but also in terms of bug fixing and application code maintenance.

The Tricky

The tricky part of multi-interface and multi-data-model databases is that all combinations can be used concurrently, in production as well as post-production (e.g. analytics). Unit and functional tests as well as performance and scale tests become a lot more complicated as they have to test the concurrent execution of various combinations.

Furthermore, many queries can be expressed in different interfaces as those tend to overlap in query expressiveness. So an application developer needs to clearly understand the pros and cons of the query execution that underlies a specific query interface.

Hopefully the query semantics is the same for all combinations (meaning, for example, that predicates are evaluated the same way) and that concurrent use of data models and interfaces does not negatively impact the various clients in terms of concurrency, scale and performance. Any bug introduced through a discrepancy might be very difficult to reproduce and fix.

Summary

While multi-interface and multi-data-model databases are a powerful technology, there is considerable impact to the application system engineering activities in terms of knowledge acquisition, development, test and maintenance.

While database vendors certainly strive to have all combinations work in harmony, there might be edge cases where one combination does not give the same result as a different one. From an application development perspective test coverage should ensure semantic equivalence of the used combinations so that misinterpretations or wrong results are avoided.

Document Projection (Part 2): Definition

What does projection in context of JSON structures or documents actually mean? What should the definition of “projection” be? There are several possibilities discussed next.

Document Projection: Complete Branch

Projection in JSON is projecting a branch of the JSON data structure tree, not projecting a single object or value. To request a projection, a property (projection) path in dot notation could be used (and actually is in many systems). The result of a projection is a valid JSON document containing the specified branch(es).

An example document is

{"a": {"b": {"c": 3, "d": 4, "e": 5}}}

Let’s go through a series of projections in the following.

  • Projection path: “a.b.c”
  • Result: {“a”: {“b”: {“c”: 3}}}
  • Projection path: “a.b”
  • Result: {“a”: {“b”: {“c”: 3, “d”: 4, “e”: 5}}}
  • Projection path: “a.e”
  • Result: {}

The result contains the full path of the projection (or more, but not less). If the requested projection path does not exist, the result is the empty document as none of its properties matches the projection path. The empty projection path “” is not defined, meaning, a projection must name at least one property, and that will be a top-level property in a document.

Several branches can be projected concurrently.

  • Projection paths: “a.b.c”, “a.b.d”
  • Result: {“a”: {“b”: {“c”: 3, “d”: 4}}}

The resulting document contains the combination of all of the branches that result in a valid projection. Redundant projection path specification is possible if one of the projection paths is a sub-path of another one. However, the result document is the same if redundancy is present or absent.

Document Projection: Partial Branch

It might be possible that the whole projection path does not exist, but a part of it. In this case it is a possibility to add the existing result up to that point (MongoDB follows this approach). This results in partial paths whereby the value of their last property is the empty document.

For example, “a.b.f” would result in {“a”: {“b”: {}}}. “a” and “b” exist in the example document, “f”, however, does not.

In my opinion, while possibly useful in some cases, I would not make this the default or standard definition as a result is returned that is incomplete and I could argue that it is in fact incorrect since the value of “b” is not the empty document (I could envision a configuration setting that provides these partial branches if needed).

Document Projection: Value

Wait a minute, why does the result document have to consist of full paths?

The reason is based on the implicit restriction on JSON documents that there can be only one property of a given name on the same level in a document. “Implicit” because the JSON definition (http://json.org/) does not mandate the restriction, but many implementations do: property names on the same level of embedding have to be unique.

For example:

{"x": {"b": {"c": 3, "d": 4}}, 
 "y": {"e": {"c": 3, "d": 4}}}

is a perfectly valid document where the property names are unique on every level. So let’s get back to projection and let’s for a moment assume that projection actually returns the value at the end of the path, omitting the path to the property value itself. So,

  • Projection path: “x.b.c”
  • Result: {“c”: 3}

So far so good.

  • Projection paths: “x.b.c”, “y.e.c”
  • Result: ?

What should the result be? The basic assumption is that a projection on a document returns a document. But “x.b.c” and “y.e.c” both return {“c”: 3} as separate documents, but not one document.

  • One possible result could be an array with two documents. But arrays are in general not considered valid top level documents (again, the JSON definition would allow that).
  • To mitigate that, the array could be the value of a property: {“result”: [{“c”: 3}, {“c”: 3}]}. But this would conflict with a document that happens to have a property “result” of type array with two same documents in it.
  • Or, the two documents would have to be embedded in a third one with special names, like {“1”: {“c”: 3}, “2”: {“c”: 3}}. But then, the original document does not have the properties “1” or “2”.

Based on this discussion having projection results being full paths is simpler and direct.

Projection – Result Correspondence Principle

There is also another argument from the user’s viewpoint. If a user wants to project “x.b.c”, then the user might want to access the result document after the query returns with “x.b.c” as the access path. From this viewpoint, the path in the query and the path in the result document should match and not require access path transformation.

Array Projection: Complete Access Path

Documents may contain arrays as well as array of arrays, arrays of objects of arrays, etc., in principle any of these composite types can be on any level of the document. Projection therefore has to be defined on arrays also, not just documents.

The principle of project paths is extended to include array index specification. For example, let’s consider this document:

{"a": [{"a1": 1}, {"a2": 2}], 
 "b": {"c": [{"c1": 3}, {"c2": 4}, {"c3": 5}]}, 
 "d": [6, 7]}

Let’s do a few projections (arrays are 0-index based):

  • Projection path: a[0]
  • Result: {“a”: [{“a1”: 1}]}
  • Projection path: b.c[1]
  • Result: {“b”: {“c”: [“c2”: 4]}}
  • Projection paths: a[1], b.c[2].c3
  • Result: {“a”: [{“a2”: 2}], “b”: {“c”: [{“c3”: 5}]}}
  • Projection path: a[7]
  • Result: {}

Like in the case of documents, full paths are requested and full paths are returned, with several paths possible. A projection path referring to a non-existing property will not contribute to the result.

So far, so good, except that the results do not yet conform to the “Projection – Result Correspondence” principle from above: the projection “a[1]” resulted in a correct document, but that result document cannot be accessed with “a[1]” to obtain the value.

Array Projection: Padding

In order to support the “Projection – Result Correspondence” principle array results can be padded with the value “null”. For example:

  • Projection paths: a[1], b.c[2].c3
  • Result: {“a”: [null, {“a2”: 2}], “b”: {“c”: [null, null, {“c3”: 5}]}}

Now it is possible to access the result with “a[1]” or “b.c[2].c3” in order to obtain the proper results. From a user’s perspective this is great as again the paths used to specify the projection can be used to retrieve the values.

Array Projection: Scalar Values

Scalar values in arrays do not pose a specific problem:

  • Projection paths: a[1], d[1], d[2]
  • Result: {“a”: [null, {“a2”: 2}], “d”: [null, 7]}

And their access can be accomplished using the projection paths.

Summary

Initially I thought projection is a straight forward function and not worth a discussion in context of document-oriented databases; but then it turned out to be not that clear cut. Nevertheless, the above is a starting point for a strict rationalization of projection in document-oriented databases based on the JSON data model.

Document Projection (Part 1): MongoDB

This blog reviews some of the projection functionality that the MongoDB query interface provides. The emphasis is comparing projection of embedded object properties with projection of embedded array elements. Those are not symmetric, as the examples will show, and that is surprising and rather unexpected.

Test Data Set

The initial test data set contains three documents:

{}
{"a": 1, "b": 2}
{"c": {"d": 3, "e": 4, "f": {"g": 5, "h": 6}}}

The empty document is the control document, and two documents have properties whereby one of those has several levels of embedding. The test data set, contained in a file “td.txt”, is loaded into MongoDB as follows:

mongoimport -d projection -c proj td.txt

Projecting Document Properties

Let’s observe a few projection queries:

> db.proj.find({}, {_id: 0, a: 1})
{  }
{ "a" : 1 }
{  }

In general, to make the result less verbose, the property “_id” is suppressed. The query asked for property “a” and for each document the property “a” is returned (whereby two documents do not contain “a”, so the resulting documents are empty).

> db.proj.find({}, {_id: 0, "c.e": 1})
{  }
{  }
{ "c" : { "e" : 4 } }

Reaching into documents is done by using the dot notation. “c.e” asks for the property “e” in “c”. The result contains the whole document structure starting at “c” all the way to “e”. Alternatively MongoDB could have returned {“e”: 4} only, but then the result would not correspond to the dot notation in the query.

> db.proj.find({}, {_id: 0, "c.f": 1})
{  }
{  }
{ "c" : { "f" : { "g" : 5, "h" : 6 } } }

No surprise here as the result contains the property “c.f”.

> db.proj.find({}, {_id: 0, "c.f.h": 1})
{  }
{  }
{ "c" : { "f" : { "h" : 6 } } }

Reaching further into the document works as expected also.

> db.proj.find({}, {_id: 0, a: 1, c: 1})
{  }
{ "a" : 1 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }

Asking for different properties returns all of those for each document.

> db.proj.find({}, {_id:0, "c.e":1, "c.f.h":1})
{  }
{  }
{ "c" : { "e" : 4, "f" : { "h" : 6 } } }

Asking for different properties in the same embedded document returns a combination of the properties, not separate properties. This makes sense as in each document a property with a given name can only appear once.

> db.proj.find({}, {_id: 0, "c.g": 1})
{  }
{  }
{ "c" : {  } }

This query asks for an embedded property that does not exist. However, the properties on the path to that missing property are actually included in the result. This is surprising to me as I would expect that if a property does not exist, no property is included in the result at all, especially not a partial property, so to say. In this case, the path in the query “c.g” does not match any path in the result.

Projecting Array Element Properties

First of all, another document is added to the test data set that contains arrays and nested document that in turn contain an array:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}]}

So a total of 4 documents are in the test data set now.

Projection of arrays is done using the “$slice” operator described here: http://docs.mongodb.org/manual/reference/operator/projection/slice/ (in addition to predicate-driven selection/projection using “$” or “$elemMatch”). Let’s try.

> db.proj.find({}, {_id: 0, "x": {$slice: 1}})
{  }
{ "a" : 1, "b" : 2 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }
{ "x" : [ { "y" : 10 } ] }

This asks for one element from the array “x”, and indeed, the first element is returned. However, all properties of every document not containing “x” are returned also. This is surprising and in contrast to the behavior when projecting properties of embedded documents.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: 1}})
{ "x" : [ { "y" : 10 } ] }

The “$exists” operator limits the documents to those that contain “x” only. This is still not the same semantics as in the document property projection, but closer.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: [1, 1]}})
{ "x" : [ { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

This selects the second array element (the first “1” indicates the number of skips, and the second “1” indicates how many elements should be selected.

This is different from selecting the second property, in my opinion, as in case of projecting the second property it would be important to see in the result that the second property was projected. This definitely debatable, by in analogy to projecting embedded document properties, the result would have to reflect the query.

Let’s try to select the second element of “w”. This requires reaching into the array on the first level.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.1.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

The approach using the dot notation fails. The query does not honor “x.1.w”, specifying: project from “w”, which is the second array element “x.1”. However, the interface is not giving an error, either, as it probably should?

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }

This works. MongoDB seem to automatically interpret this correctly. However, the first array element of “x” is returned also, again in contrast to the approach for document properties where properties that are not on they path will not be returned.

But what if “x” would contain an additional array element with property “w”? Let’s add this document:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}, {"w": [{"z": 13}, {"v": 14}]}]}

Now 5 documents are in the test data set.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] }, { "w" : [ { "v" : 14 } ] } ] }

Turns out, MongoDB selects all documents that contain a property “w”, and from those the second element. This looks reasonable.

But, how is the second element of the first “w” selected? I think multi-level projection in context of arrays is not possible at this point: https://jira.mongodb.org/browse/SERVER-831.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.q": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "w" : [ { "z" : 13}, { "v" : 14 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "a" : 20 } ] }

Asking for an non-existing property returns the complete array, again, in contrast to the analogous query in embedded documents.

I’ll stop here at examining projecting array elements as clearly there is limited support for it as this point in time.

Summary

Surprisingly, at least for me, MongoDB does not follow the same design rules when projecting properties from embedded documents compared with projecting elements from embedded arrays. Not only is the behavior different

  1. Properties of documents not containing the requested array projection are returned
  2. Properties not on the path to the projected element are returned

but also multi-level projection does not work in a straight forward way: it is not possible to use dot notation to reach into nested arrays for projection.

Querying for non-existing properties results in partial or incorrect results, in my opinion. Of course, there are different viewpoints possible on this behavior, and for sure it warrants further discussion.

Schema-free Database (Part 1): An Oxymoron

The notion of a ‘schema-free database’ keeps coming up, most recently in a meetup I attended a few days ago. Some rationalization follows divided up into the categories of ‘document’ and ‘database’.

While a generalization is easily possible, the context here will be JSON and MongoDB as these are two practical implementations that are available and often used as examples of a ‘schema-free database’. Those provide a nice constraint technology set as an example, while the principles apply to a whole range of other technology, of course.

Document

A JSON document, in short: document, follows a set of construction principles outlined here: http://www.json.org/. This is a rather informal grammar that defines how a valid JSON document is constructed. There are no data type generators and so new data types cannot be introduced; therefore, every document is constructed from the fixed set of types enumerated on that web page.

Document Schema

An attempt has been made to create a more formal mechanism to define a schema for JSON documents: http://json-schema.org/. This approach provides a formal language to describe the schema of a JSON document explicitly.

The json-schema approach combined with the fixed set of types available to create a JSON document means that every JSON document can be described explicitly using json-schema without exceptions. This in turns means that every JSON document has at least an implicit schema, unless it is additionally made explicit with e.g. json-schema.

Therefore, JSON documents have a schema, an implicit one and optionally an explicit one. Depending on the particular schema definition language approach itself, a document might match more than one schema, but that is left for a separate discussion.

Set of Document Schemas

Given a set of JSON documents it is now possible to characterize their relationship to schemas. In the ‘best’ case, all documents follow the same schema; in the ‘worst’ case, each document follows its own schema. And there are cases in-between where a subset of the documents validates against a schema, and another subset against another schema. Depending on the design, one JSON document might validate against different schemas.

The relationship between documents and schemas is n:m in general.

Database

In context of a database, there are a few interesting questions in this context:

  • Does the database understand a document representation (e.g. JSON)?
  • Does the database enforce a document representation?

And:

  • Does the database enforce a schema?
  • Does the database understand a schema?

Let’s answer these questions for MongoDB specifically:

And:

  • MongoDB enforces a partial schema. Each document must have a property called “_id”; if the document being inserted does not have such a property, one is automatically added.
  • MongoDB does not understand an explicit schema as it does not provide for a mechanism to load a schema definition language.

MongoDB, however, understands implicit schemas as MongoDB does allow to e.g. create an index on any property of documents. So MongoDB recognizes properties.

Furthermore, MongoDB supports aggregation functions and supports e.g. the sum of properties across documents (https://realprogrammer.wordpress.com/2012/11/04/null-undefined-nan-and-missing-property-goto-considered-harmful-part-2/). So it is data type aware and implements operators (e.g. sum) on those.

Conclusion: The Notion of ‘Schema’ is Changing

This rather brief discussion clearly rationalizes that the label ‘schema-free database’ is not applicable to technologies such as those discussed in this blog (JSON/BSON, MongoDB).

Hence these technologies are not an example of ‘schema-free database’, to the contrary: they demonstrate that the notion of ‘schema’ can have a wider and more flexible interpretation then what relational databases bring forward.

MongoDB and node.js (Part 1): List of Documents for all Types

This blog contains the document set that will be used throughout the blog series.

Document Set Characteristics

The data set is a collection of documents, whereby each document contains three properties:

  • “x”: This property contains a different data value in each document
  • “comment”: this property describes “x” as originally specified. When the document is queried, it is clear what the original specification of “x” was. This is necessary as the node.js driver modifies documents on insert.
  • “btype”: this property contains the BSON type designator. When the document is queried, the type is clear without having to query it explicitly.

Since the blog series is evolving, the set of documents might change as needed in order to point out more specifics; the current set is a good starting point, however.

If there is more than one way to insert a specific type, several documents are included, one document for each possibility. This is to show the possible alternatives exhaustively.

JavaScript, BSON, JSON

Node.js is is implementing server-side JavaScript (based on Google’s V8 Engine). So the data types that are passed back and forth to the MongoDB Node.js driver are JavaScript types.

MongoDB internally operates on BSON. This means that on the way from the Node.js driver to MongoDB JavaScript types are transformed into BSON (and vice versa on the way back). For some BSON types MongoDB provides constructors.

While many equate JavaScript data types structures with JSON, actually JSON is not an equivalent serialization of JavaScript types (more on that issue during the blog series).

Based on this discussion, the test data document set tries to cover all JavaScript types and their variation, plus the JavaScript implementation of BSON types. This will also clarify more over the course of the blog series.

Document Set

The following code is complete in the sense that it runs and inserts the documents into a MongoDB database called “nodeTest”, using all the defaults.

/*global require*/

var MongoDB = require('mongodb');

/*
 Type codes
 ==========
 1 "\x01" e_name double             Floating point
 2 "\x02" e_name string             UTF-8 string
 3 "\x03" e_name document           Embedded document
 4 "\x04" e_name document           Array
 5 "\x05" e_name binary             Binary data

 7 "\x07" e_name (byte*12)          ObjectId
 8 "\x08" e_name "\x00"             Boolean "false"
 8 "\x08" e_name "\x01"             Boolean "true"
 9 "\x09" e_name int64              UTC datetime
 10 "\x0A" e_name Null value
 11 "\x0B" e_name cstring cstring   Regular expression

 13 "\x0D" e_name string            JavaScript code

 15 "\x0F" e_name code_w_s          JavaScript code w/ scope
 16 "\x10" e_name int32             32-bit Integer
 17 "\x11" e_name int64             Timestamp
 18 "\x12" e_name int64             64-bit integer
 255 "\xFF" e_name Min key
 127 "\x7F" e_name Max key

 Deprecated type codes
 =====================
 6 "\x06" e_name                    Undefined — Deprecated
 12 "\x0C" e_name string (byte*12)  DBPointer — Deprecated
 14 "\x0E" e_name string            Symbol — Deprecated

 */

var typeDocuments;

typeDocuments = [
    {"x": new MongoDB.Double(123.123),
        "comment": "new MongoDB.Double(123.123)",
        "btype": 1},
    {"x": 456.456,
        "comment": "456.456",
        "btype": 1},
    {"x": "abc",
        "comment": "abc",
        "btype": 2},
    {"x": {"z": 5},
        "comment": "{\"z\": 5}",
        "btype": 3},
    // this is not type:4
    {"x": [9, 8, 7],
        "comment": "[9, 8, 7]",
        "btype": 16},
    {"x": [
        {"y": 4},
        {"z": 5}
    ], "comment": "[{\"y\": 4}, {\"z\": 5}]",
        "btype": 3},
    {"x": new MongoDB.Binary("binary"),
        "comment": "new MongoDB.Binary(\"binary\")",
        "btype": 5},
    // t:6 deprecated (was 'undefined') - not implemented
    {"x": new MongoDB.ObjectID("5040dc5d40b67c681d000001"),
        "comment": "new MongoDB.ObjectID(\"5040dc5d40b67c681d000001\")",
        "btype": 7},
    {"x": false,
        "comment": "false",
        "btype": 8},
    {"x": true,
        "comment": "true",
        "btype": 8},
    {"x": new Date("2012-08-31 12:13:14:156 UTC"),
        "comment": "new Date(\"2012-08-31 12:13:14:156 UTC\")",
        "btype": 9},
    {"x": null,
        "comment": "null",
        "btype": 10},
    {"x": new RegExp("abc"),
        "comment": "new RegExp(\"abc\")",
        "btype": 11},
    {"x": new RegExp("abc", "i"),
        "comment": "new RegExp(\"abc\", \"i\")",
        "btype": 11},
    // t:12 DBRef deprecated - still implemented
    // this is not type:12
    {"x": new MongoDB.DBRef("types_node", "5040dc5d40b67c681d000001", "types"),
        "comment": "new MongoDB.DBRef(\"types_node\", \"5040dc5d40b67c681d000001\", \"types\")",
        "btype": 3},
    // this is not type:12
    {"x": new MongoDB.DBRef("types_node", "5040dc5d40b67c681d000001"),
        "comment": "new MongoDB.DBRef(\"types_node\", \"5040dc5d40b67c681d000001\")",
        "btype": 3},
    // MongoDB defined JSON serialization (http://docs.mongodb.org/manual/reference/mongodb-extended-json/)
    // this is not type:12
    {"x": {"$ref": "types_node", "$id": "5040dc5d40b67c681d000001", "$db": "types"},
        "comment": "{\"$ref\": \"types_node\", \"$id\": \"5040dc5d40b67c681d000001\", \"$db\": \"types\"}",
        "btype": 3},
    // this is not type:12
    {"x": {"$ref": "types_node", "$id": "5040dc5d40b67c681d000001"},
        "comment": "{\"$ref\": \"types_node\", \"$id\": \"5040dc5d40b67c681d000001\"}",
        "btype": 3},
    {"x": new MongoDB.Code("function () {}"),
        "comment": "new MongoDB.Code(\"function () {}\")",
        "btype": 13},
    // t:14 Symbol deprecated - still implemented
    {"x": new MongoDB.Symbol("def15"),
        "comment": "new MongoDB.Symbol(\"def15\")",
        "btype": 14},
    {"x": new MongoDB.Code("function (a) {}", {"a": 4}),
        "comment": " new MongoDB.Code(\"function (a) {}\", {\"a\": 4})",
        "btype": 15},
    {"x": 123456,
        "comment": "123456",
        "btype": 16},
    {"x": new MongoDB.Timestamp(1, 2),
        "comment": "new MongoDB.Timestamp(1, 2)",
        "btype": 17},
    {"x": new MongoDB.Long("987"),
        "comment": "new MongoDB.Long(\"987\")",
        "btype": 18},
    {"x": new MongoDB.MinKey(),
        "comment": "new MongoDB.MinKey()",
        "btype": 255},
    {"x": new MongoDB.MaxKey(),
        "comment": "new MongoDB.MaxKey()",
        "btype": 127},
    // ADDITIONAL POSSIBLE VALUES
    // 'undefined' will be converted to 'null'; type will be 'null' (aka 10) also
    {"x": undefined,
        "comment": "undefined",
        "btype": 10},
    {"x": Number.NaN,
        "comment": "Number.NaN",
        "btype": 1},
    {"x": Infinity,
        "comment": "Infinity",
        "btype": 1},
    {"x": Number.POSITIVE_INFINITY,
        "comment": "Number.POSITIVE_INFINITY",
        "btype": 1},
    {"x": Number.NEGATIVE_INFINITY,
        "comment": "Number.NEGATIVE_INFINITY",
        "btype": 1},
    {"x": Number.MIN_VALUE,
        "comment": "MIN_VALUE",
        "btype": 1},
    {"x": Number.MAX_VALUE,
        "comment": "MAX_VALUE",
        "btype": 1}
];

var Db = MongoDB.Db,
    Server = MongoDB.Server;
var db = new Db('nodeTest', new Server("127.0.0.1", 27017,
    {auto_reconnect: false, poolSize: 4}), {native_parser: false, safe: false});

db.open(function (err, db) {
    "use strict";
    db.dropCollection("types_node", function (err, result) {
        var i,
            printLog = function (err, result) {
                if (err) {
                    console.log(err.toString());
                }
                console.log(result);
            };
        if (err) {
            console.log(err.toString());
        }
        console.log("Drop Collection Result: " + result);
        db.collection("types_node", function (err, collection) {
            for (i = 0; i < typeDocuments.length; i = i + 1) {
                collection.insert(typeDocuments[i], {safe: true}, printLog);
            }
        });
    });
});

MongoDB and node.js (Part 0): Rationalization of Data Types

This is the start of a blog series in context of MongoDB and node.js. It is a rationalization of data type handling and transfer between JavaScript (node.js) and MongoDB. The series is very specific to MongoDB and the 10gen supported MongoDB node.js driver.

How did the opportunity for this blog series arise? Basically, my plan was to do a round-trip from JavaScript to MongoDB and back for each data type and constant in JavaScript.

While an easy plan to come up with, executing it turned out to be a little journey. During that journey a few issues arose that I did not expect and the blog series will report on those. Along the way a few JIRAs and GitHub issues were filed and will be reported on in the appropriate places.

Philosophy

I had and will have a clear philosophy in mind throughout this journey:

  • Treating MongoDB as a database. What I mean by that is that if I can store data into MongoDB, I expect that I can query the data and retrieve it again. Especially, when storing a document, I expect to be able to query this document again.
  • Foreign Database ownership. This means that I do not assume to have control over the database and its contents, but assume that a database is given to me. However some user managed to put documents into the database, I should be able to retrieve it.
  • Conventions are Conventions. Often I hear: ‘don’t do that’; or, ‘this is an internal type, don’t use it’. While I as a human might follow this advice, the software (aka MongoDB) might or might now know about it. So if it does not enforce constraints at its interface, then I assume there is no technical need for these constraints in order for MongoDB to work properly.

This philosophy might sound obvious in a database context, but along the steps of the journey I’ll refer back to it at various places as it makes a difference.

Starting Point: Creating documents for all JavaScript Types

The blog series will be based on a specific collection and its set of documents. Each document represents a BSON type and contains a property with one type and additional ‘helper’ properties. This will serve as the basis for query execution. Each document contains three properties:

  • “x”: This property contains a different data value in each document
  • “comment”: this property describes “x” as originally specified. When the document is queried, it is clear what the original specification of “x” was. This is necessary as the node.js driver modifies documents on insert.
  • “btype”: this property contains the BSON type designator. When the document is queried, the type is clear without having to query it explicitly.

The next blog in this series will discuss this collection of documents in detail.

Where do we go from here?

The various parts of the blog series lead us around into may corners of the JavaScript data types and how they are mapped to MongoDB (and vice versa: how the BSON types are mapped (back) to JavaScript).

The goal is to discuss all aspects; and the blogs in the series are not static, meaning, if changes or more insights are coming up, the blogs will actually be updated.

So, let’s go.

Negation and Complement in a Multi-Schema World

Negation and complement are not so straight forward in a multi-schema world where different documents in a collection follow different schemas. Let’s start with an example.

Example: Count()

> db.nc.save({"a":"value", "b":"v1"})
> db.nc.save({"a":"value", "b":null})
> db.nc.save({"a":"value"})

Now let’s count how many documents are there with ‘b’ having a value of either ‘null’ or not ‘null’.

> db.nc.find({"$or":[{"b":null}, {"b":{$ne:null}}]}).count()
3

Is 3 the correct result?

Basing the query on $type instead results into the same count:

> db.nc.find({"$or":[{"b":{$type:10}}, {"b":{$not:{$type:10}}}]}).count()
3

Is that a correct count?

Interpretation of ‘null’

The query system of MongoDB interprets a query for a property with value ‘null’ in a specific way: if the property is present, then it is checked if the property has the value ‘null’. If the property is not present, then it is interpreted as being present with the value of ‘null’.

In this interpretation, the above query results are correct.

Alternative Interpretation of ‘null’

If only those documents should be considered that actually have the property present, the queries have to be extended to check for that.

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":null}, {"b":{$ne:null}}]}]}).count()
2

and

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":{"$type":10}}, {"b":{$not:{"$type": 10}}}]}]}).count()
2

In order to change this interpretation an additional condition is added that checks for the existence of the property being queried. Luckily, the desired interpretation was possible through changing the query without requiring and code.

Looking Beyond ‘null’

Investigating the behavior around ‘null’ was a relatively small effort as it is quite compact. However, with additional operators like ‘$lt’, ‘$gt’, etc., more situations with specific interpretations can be found and have to be carefully planned through when implementing.

There are three basic rules

  • If every document has the same schema (meaning, the same properties with the same names and types), then queries can be written without caveats wrt. multi-schema collections.
  • If documents have different schemas in the sense the some contain a property and some do not contain that property, then queries should be written in such a way that the absence of a property does not cause a wrong result or a wrong interpretation.
  • The same is true for documents that have the same properties, but different types (like the property is sometimes a string and sometimes a number). In this case the queries should be written in such a way that the different types do not lead to a wrong interpretation of the result.

A combination of three cases is possible, meaning, that, for example, all documents have a property ‘p1’ of the same type, a property ‘p2’ that is sometimes not present and a property ‘p3’ that is sometimes not present and sometimes is of different types.

The only alternative to being really careful with the query construction and interpretation is to transform all documents so that they all follow the same schema (global schema for that collection). The particular context and situation determines what the best approach is.

So what’s the Big Deal?

The issues around queries in a multi-schema collection are probably less of a big deal if you are in complete control of your code. In this case you can make sure yourself that the correct interpretation is followed everywhere (by encapsulation, constraint checking, code review or mandate).

However, if you are using libraries as-is and you do not have the ability to change them, you need to make sure that their internal interpretation matches yours. If not, errors can sneak easily into your code base.

On top, sometimes the result determined by code are checked by running queries on the command line or shell. It is not always obvious if e.g. an ‘$exists’ was applied in the code or not. So the ‘checking’ of results using the command-line has its own challenges.

Null, Undefined, NaN and Missing Property: Goto Considered Harmful (Part 3)

It turns out there is a lot more to discuss about MongoDB and its handling of basic and complex data types. That warrants a separate series of blogs altogether. Here is one more part in this three part series; a future new series will focus exclusively on MongoDB data types.

MongoDB and ‘undefined’

As discussed before, it is possible to store a document in MongoDB that has a property with value ‘undefined’. However, it turns out that MongoDB on inserting is changing ‘undefined’ to ‘null’. So a change in value is taking place implicitly. So while MongoDB accepts the value ‘undefined’ on its interface, storage-wise it is changed to ‘null’.

As a consequence, it is not possible to query for ‘undefined’. When querying for ‘null’ all the documents will be returned that were originally stored with the values ‘null’ as well as ‘undefined’ (as ‘undefined’ was changed to ‘null’).

Implication

The implicit change from ‘undefined’ to ‘null’ has a few important implications:

  • If an application makes a semantic distinction between ‘undefined’ and ‘null’, that distinction will not be kept when storing ‘undefined’ and ‘null’ as MongoDB will not keep those as separate values. Instead, the application must record the fact that a property value is ‘undefined’ in a different way (e.g., with an embedded document or a specific value that it interprets as ‘undefined’).
  • MongoDB applies implicit data value changes and that in specific cases the values being put in are not the values being returned when queried. ‘undefined’ is one of values that will be changed. Applications must be aware of this in order to decide if this is OK with them or not. If not, a different representation must be used by the application that does not fall into this category.

Conclusion

This 3-part mini-series started out looking at aggregation in the presence of various data types and brought up a few examples that show different interpretations when aggregating. These interpretations are essential when using MongoDB.

A side effect of the mini-series is that a larger discussion of data types and MongoDB opened up and is necessary to fully understand MongoDB’s behavior. A future blog series will address data types exclusively.

MongoDB: … $where a < b

$where allows to expand the expressiveness of MongoDB queries giving programmers an important choice when querying MongoDB.

Use Case

The use case for this blog is to find all documents where the value of property “b” is larger than the value of property “a”. The data set is

> use where
switched to db where
> db.whereColl.save({"a":3, "b":5})
> db.whereColl.save({"a":4, "b":5})
> db.whereColl.save({"a":5, "b":5})
>

Let’s try a simple query first (real values replaced by “…” used to save space):

> db.whereColl.find({"b": {$gt : 4}})
{ "_id" : ObjectId("..."), "a" : 3, "b" : 5 }
{ "_id" : ObjectId("..."), "a" : 4, "b" : 5 }
{ "_id" : ObjectId("..."), "a" : 5, "b" : 5 }
>

This query works quite nicely when asking for all document where “b” is greater than 4. Let’s try to now run the query we set out from:

> db.whereColl.find({"b": {$gt : "a"}})
>

This query does not return a result. Most likely because property “b” contains numbers and “a” is a string constant. Let’s try a different approach:

> db.whereColl.find({"b": {$gt : a}})
Tue Dec 14 17:59:33 ReferenceError: a is not defined (shell):1
>

This query does not return the desired result, either. The reason is that “a” as a variable is not defined.

$where

Enter $where. $where gives the ability to ask the database to execute a Javascript function (or expression) during query processing. Let’s try this query:

> db.whereColl.find({$where: "this.b > this.a"})
{ "_id" : ObjectId("..."), "a" : 3, "b" : 5 }
{ "_id" : ObjectId("..."), "a" : 4, "b" : 5 }
>

This query works. “this” points to the current document at evaluation time and supports the access to properties in the document. This example shows that $where allows expressions to be evaluated by the database that cannot be formulated using the regular query approach of MongoDB.

Discussion

Leaving expressiveness aside, the MongoDB documentation gives variations on the $where as well as performance hints: http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server-sideCodeExecution-{{%24where}}ClausesandFunctionsinQueries.

With the strong warning for the use of $where, when does it make sense to use? First, performance measurement with and without the use of $where is a good strategy because if the database does not execute the condition in $where, the application code has to do it. There might be cases where the database is faster.

From an application design perspective, it is always possible to use $where in order to get the queries correct. And later, optimize by removing the $where and moving the expression into the application programs if the $where turns out to be slower. So there is an advantage from a design perspective for sure.