NoSQL Databases: Data First, Schema Second? Or Vice Versa?

When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?

Global vs. Local vs. Mixed Document Schema

One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.

A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.

A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.

In terms of system examples:

  • Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
    • Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
  • MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
    • There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
  • FoundationDB [https://foundationdb.com/] follows the global schema approach.
    • FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
  • Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
    • A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.

Explicit vs. Implicit Document Schema

A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).

In terms of system examples:

  • Oracle NoSQL: supports both, explicit and implicit schemas
  • MongoDB: implicit schema
  • FoundationDB: explicit schema
  • Oracle 12c: implicit schema

From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.

The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.

Mandatory Document Schema

A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.

In terms of system examples:

  • Oracle NoSQL: no mandatory schema
  • MongoDB: no mandatory schema
  • FoundationDB: requires a mandatory schema
  • Oracle 12c: no mandatory schema

Roles of Database Users

Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:

  • Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
  • Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.

A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.

Schema First or Schema Second?

From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:

  • Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
    • local schema – implicit schema – not mandatory schema (‘schema second’)
  • Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
    • global schema – explicit schema – mandatory schema (‘schema first’)

The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.

A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.

A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.

Co-existence of Schema First and Schema Second?

In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:

  • ETL (extract – transform – load) from data collector database to data user database
    • This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
  • View on data collector database
    • This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
  • Automatic schema extraction
    • A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
  • Intelligent ORM Layer
    • An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.

From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.

If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.

Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.

Schema-Varying Languages?

It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?

While this is a huge topic on its own, server-side JavaScript might be a good place to start as the language is not based on a class/instance paradigm, but prototype approach. It’s type system is almost equivalent with JSON. The language, therefore, is able to represent documents with local schema easily and effortlessly. Since JavaScript can introspect objects and since it implements the prototype mechanism it is possible to represent local functionality for documents with a local schema. One of the bigger questions is how to represent this flexibility to the end user on user interfaces in an ergonomic way.

However, this is a discussion on its own and I’ll save it for a later blog.

Summary

When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.

So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.

Negation and Complement in a Multi-Schema World

Negation and complement are not so straight forward in a multi-schema world where different documents in a collection follow different schemas. Let’s start with an example.

Example: Count()

> db.nc.save({"a":"value", "b":"v1"})
> db.nc.save({"a":"value", "b":null})
> db.nc.save({"a":"value"})

Now let’s count how many documents are there with ‘b’ having a value of either ‘null’ or not ‘null’.

> db.nc.find({"$or":[{"b":null}, {"b":{$ne:null}}]}).count()
3

Is 3 the correct result?

Basing the query on $type instead results into the same count:

> db.nc.find({"$or":[{"b":{$type:10}}, {"b":{$not:{$type:10}}}]}).count()
3

Is that a correct count?

Interpretation of ‘null’

The query system of MongoDB interprets a query for a property with value ‘null’ in a specific way: if the property is present, then it is checked if the property has the value ‘null’. If the property is not present, then it is interpreted as being present with the value of ‘null’.

In this interpretation, the above query results are correct.

Alternative Interpretation of ‘null’

If only those documents should be considered that actually have the property present, the queries have to be extended to check for that.

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":null}, {"b":{$ne:null}}]}]}).count()
2

and

> db.nc.find({"$and":[{"b":{$exists:true}}, {"$or":[{"b":{"$type":10}}, {"b":{$not:{"$type": 10}}}]}]}).count()
2

In order to change this interpretation an additional condition is added that checks for the existence of the property being queried. Luckily, the desired interpretation was possible through changing the query without requiring and code.

Looking Beyond ‘null’

Investigating the behavior around ‘null’ was a relatively small effort as it is quite compact. However, with additional operators like ‘$lt’, ‘$gt’, etc., more situations with specific interpretations can be found and have to be carefully planned through when implementing.

There are three basic rules

  • If every document has the same schema (meaning, the same properties with the same names and types), then queries can be written without caveats wrt. multi-schema collections.
  • If documents have different schemas in the sense the some contain a property and some do not contain that property, then queries should be written in such a way that the absence of a property does not cause a wrong result or a wrong interpretation.
  • The same is true for documents that have the same properties, but different types (like the property is sometimes a string and sometimes a number). In this case the queries should be written in such a way that the different types do not lead to a wrong interpretation of the result.

A combination of three cases is possible, meaning, that, for example, all documents have a property ‘p1’ of the same type, a property ‘p2’ that is sometimes not present and a property ‘p3’ that is sometimes not present and sometimes is of different types.

The only alternative to being really careful with the query construction and interpretation is to transform all documents so that they all follow the same schema (global schema for that collection). The particular context and situation determines what the best approach is.

So what’s the Big Deal?

The issues around queries in a multi-schema collection are probably less of a big deal if you are in complete control of your code. In this case you can make sure yourself that the correct interpretation is followed everywhere (by encapsulation, constraint checking, code review or mandate).

However, if you are using libraries as-is and you do not have the ability to change them, you need to make sure that their internal interpretation matches yours. If not, errors can sneak easily into your code base.

On top, sometimes the result determined by code are checked by running queries on the command line or shell. It is not always obvious if e.g. an ‘$exists’ was applied in the code or not. So the ‘checking’ of results using the command-line has its own challenges.

Schema Derivation: Schema-less vs. Schema-based Documents

In the world of document-oriented databases there is the notion that documents are ‘schema-less’ because no schema has to be defined and registered with the database before documents can be stored. Sometimes this is also referred to as ‘schema-free’. Let’s investigate this further: does ‘schema-free’ really exist?

JSON

In MongoDB, for example, documents use the JSON (http://www.json.org/) structure (with an extended set of types as defined in BSON http://bsonspec.org/). So all document in the database follow the same underlying data type structure. The database will reject a document that is not JSON/BSON compliant.

Schema Enforcement vs. Schema Freedom

In a relational database the notion of schema means that the database knows about the schema (e.g. structure of a table) and all data stored must follow that structure. The database management system is the enforcing entity of the schema. Data not following the schema will be rejected.

In a document-oriented system like MongoDB the notion of schema is not known to the database and therefore not enforced by the database. Any document can have any structure as long as it is valid according to JSON.

Explicit Schema vs. Implicit Schema

In the case of relational databases, the schema is explicit as it is possible to retrieve the schema as a separate data structure from the database. Relational databases have system tables that contain their schema. So the database is aware of the schema in such a way that it can be retrieved.

In a document-oriented database it is not possible to inquire the schema as it has not been defined to the database. However, it could be the case that all documents follow a schema (that is defined outside the database). If this is the case, the schema of the documents can be called an implicit schema. It is not necessary for the database to enforce a schema for the documents to comply to one.

Schema-less, Schema-free

The fact that a document-oriented database does not enforce a schema of its documents does not mean that there is no schema. It only means that the database does not enforce it. However, it could be enforced elsewhere. For example, the software that writes the documents can enforce their schema. Or the software that reads the documents enforces the schema by only accepting those documents that comply to its expectations.

But what if a user manually defines a documents and stores it?

Individual Schema vs. Global Schema

A different viewpoint is possible. If a user stores a manually designed document then it is possible to view this document of having an individual schema. It is the only one with that schema. If the user now adds a second document that has the exact same structure, then it is possible to view these as two documents with the same implicit schema. At this point there are two documents following the same (implicit) schema. It got a bit more global, tough, as now two documents are following it. If all documents follow the same (implicit) schema, then there is an (implicit) global schema.

Different documents can follow different (implicit) schemas. In the extreme case, all documents are different from each other in their structure and each has its own schema.

Schema Derivation

JSON is a construction principle for hierarchical data structures. Each document following the JSON model is hierarchical and finite as JSON does not support (cyclic) references.

Based on the hierarchical and finite structure of JSON it is possible to derive the implicit schema of the JSON documents and make it explicit. On a very high level, the schema derivation algorithm iterates over all documents in the database and determines the schema for each of it. Documents that follow the same schema are grouped together. After all documents have been inspected, there will be one (and possibly more) schema and the associated complying documents.

Details of this algorithm will be discussed in a different blog at some later time.

Summary

Documents have a schema, there are not schema-less or schema-free. The schema of a document is implicit (and can be made explicit). Several documents can follow the same schema. While the database is not enforcing the schema, it could be enforced by a separate layer outside the database.

So, not-enforcement of a schema by the database != schema-less/schema-free.