When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?
Global vs. Local vs. Mixed Document Schema
One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.
A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.
A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.
In terms of system examples:
- Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
- Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
- MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
- There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
- FoundationDB [https://foundationdb.com/] follows the global schema approach.
- FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
- Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
- A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.
Explicit vs. Implicit Document Schema
A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).
In terms of system examples:
- Oracle NoSQL: supports both, explicit and implicit schemas
- MongoDB: implicit schema
- FoundationDB: explicit schema
- Oracle 12c: implicit schema
From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.
The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.
Mandatory Document Schema
A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.
In terms of system examples:
- Oracle NoSQL: no mandatory schema
- MongoDB: no mandatory schema
- FoundationDB: requires a mandatory schema
- Oracle 12c: no mandatory schema
Roles of Database Users
Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:
- Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
- Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.
A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.
Schema First or Schema Second?
From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:
- Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
- local schema – implicit schema – not mandatory schema (‘schema second’)
- Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
- global schema – explicit schema – mandatory schema (‘schema first’)
The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.
A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.
A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.
Co-existence of Schema First and Schema Second?
In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:
- ETL (extract – transform – load) from data collector database to data user database
- This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
- View on data collector database
- This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
- Automatic schema extraction
- A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
- Intelligent ORM Layer
- An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.
From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.
If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.
Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.
It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?
However, this is a discussion on its own and I’ll save it for a later blog.
When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.
So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.