SQL for JSON and Schema Support (Part 7): Partial Schemas for JSON Documents

This blog concludes the series on JSON schema support in context of databases with a proposal: partial schemas.

Black and White: Rigorous Schema or No Schema At All

So far the discussion around JSON schema support in databases was a black and white (or a all-or-nothing) discussion, aka, a JSON document can either change at any time or it must comply completely with a rigorous schema (structure, data types and possibly values) at any time.

The any-change-at-any-time situation is the “wild west” in terms of structure and data type variation in the sense that code accessing JSON documents has to “protect” itself by e.g. means of assertions. Code has to assume that any structure can be present and therefore has to introspect each document at every access in order to determine if it can process it. Complex error handling is necessary if the code cannot process a specific JSON document (as discussed in an earlier blog).

More than All-Or-Nothing Approach: Rationale for Reconsideration

The ability to change JSON documents randomly is portrait as a desirable and positive feature by many JSON supporting databases, however, from an engineering perspective this “randomness” is adding significant engineering and coding complexity (and significant effort to quality assurance as well).

Are random (“dynamic”) document structures really the 80% case in real situations? Or is in reality a more nuanced situation at hand?

Many documents implement the content and state of business data (functional data). In most cases there are basic data sets to be supported. For example, purchase orders must have an overall order value field, health records must have a unique record identifier including some basic patient data, user accounts must have a unique account identifier and at least an account login name. There is usually a core set of attributes that represents the commonly agreed set of business data and those are mandatory as well in order for the business data to be meaningful.

Each (business) domain has its own commonly agreed core semantics and therefore commonly agreed data attributes. Code should be able to assume that the core semantics is complied to and that the core attributes are present in every JSON document representing the domain semantics.

So maybe it is not it really all-or-nothing in terms of the dynamic structure of JSON documents all the time. Maybe it would be really useful to be able to specify a schema that applies to (only) a part of a JSON document at any time without enforcing complete compliance: partial schemas.

Partial Schema Rationalization

If a JSON document must be partially compliant, then one part of it must comply to a given partial schema, and it can have any form outside of the part governed by the partial schema.

For example, a partial schema for a purchase order could enforce that a purchase order must have an purchase order identifier, a buyer identifier, a set of line items, and a purchase order amount at any point in time. A purchase order could contain additional supplemental data that are not governed by the schema (the dynamic part).

With a rigorous partial schema in place code is guaranteed that the attributes defined by the schema are present. This makes the executions of assertions unnecessary and code can be sure that all documents are compliant with the partial schema.

Optionally a partial schema could define that some attributes must not always be present, but in case they are present, they have to comply to a given schema.

For example, if the full specification of a buyer is provided, not just the buyer identifier, then it must contain the buyer’s legal name, the legally registered address as well as a phone number that reaches the buyer. Similar for a shipping address: if one is provided, it has to be a full shipping address denoting a physical location.

With such a schema in place, code can check if specific optional items are present, and if they are, their schema is guaranteed, meaning, these items comply with their definitions in the partial schema.

Finally, a partial schema does not govern any structure outside the mandatory or optional items and it is possible for a JSON document to contain additional data that is not enforced in terms of presence or structure.

For example, items not enforced by a schema could be supporting documentation for line items like drawings, notes to shippers asking for specific handling, or a trustworthiness rating of a buyer in a purchase order. A user account could have a history of logins or a health record could contain the list of doctors that were involved in treatments over time.

For data items that are not enforced, code would have to use assertions in order to establish their presence (and their data types).

Partial Schema Specification

How would a partial schema specification look like? In principle, it could be an extension what has been discussed in earlier blogs of this series with a different semantics. First, any schema would be considered partial, meaning, a JSON document can contain data items that are not contained in the schema in addition to those specified by the partial schema.

Second, each specification in the schema has a “mandatory” and an “optional” designation defining the mandatory items, as well as the optional items. The following schema shows the possible extensions. There are two designations, “mandatory” and “optional” that indicate what the partial schema governs.

  "type": "object",
  "mandatory": [
  "optional": [
  "properties": {
    "orderId": {
      "type": "number",
      "description": "Order Identifier: must be of 
                      type number and is mandatory"
    "orderDate": {
      "type": "string",
      "description": "Order Date: must be of 
                      type string and is mandatory"
    "orderLineItems": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "itemId": {
            "type": "number"
          "numberOrdered": {
            "type": "number"
      "description": "Order Line Items: must be of 
                      type array and is mandatory"
    "buyer": {
      "type": "number",
      "description": "Buyer Identifier: must be of 
                      type number and is optional"

The above schema enforces the structure of JSON documents in part and specifies mandatory as well as optional attributes. There are three mandatory attributes and one optional attribute. Such a schema could be added to an identifiable collection of JSON documents ensuring thereby their partial compliance.

A change in a partial schema cannot take place without modifying the JSON documents as well since the JSON documents must be compliant to a partial schema at any time. Again, code is assured compliance of JSON documents with the attributes defined by the partial schema and code therefore does not have to protect itself.


This conclusion of the blog series on JSON schema support in databases provided a proposal for partial schemas. Partial schemas support structure and data type guarantees for code as well as support the freedom for dynamic schema changes outside the part of a JSON document that has to be compliant with a partial schema.

Following this approach

  • Code is guaranteed that JSON documents comply to a partial schema
  • An application has the freedom to add additional data dynamically not governed by the partial schema
  • Over time, as dynamic structures in a JSON document stabilize, those can be moved into the stable partial schema part of a JSON document.

Partial schemas are a win-win for both, databases supporting JSON documents as well as code accessing these databases.

Go [ JSON | Relational ] SQL!


The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


SQL for JSON and Schema Support (Part 6): Dynamic Schema Change and Symbolic Computation

How does software engineering look like in context of the absence of schema support? What now?

Summary – Where Are We?

The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.

Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).

Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).

A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).

Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.

So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.

Engineering Approaches

If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.

The following sections discuss one approach of run-time assertions and the phases of processing:

  • Run-time assertions
  • Assertion execution frequency
  • JSON document non-compliance
  • Non-compliance analysis and mitigation

In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.

As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.

Run-Time Assertions

Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.

For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):

assert(is_of_type(order.value, number) 
       || is_of_type(order.value, string))

Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.

The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.

Assertion Execution Frequency

Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.

There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.

some_function() {
// retrieve document
// assert(...)
// execute logic

some_function_2() {
// retrieve document
// assert(...)
// execute logic

In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.

// start transaction

// assert()

some_function() {
// retrieve document
// execute logic

some_function_2() {
// retrieve document
// execute logic

// end transaction

In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.

Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.

Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.

JSON Document Non-Compliance (Violated Assertions)

If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:

  • The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
  • An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
  • Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).

In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.

If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.

Non-Compliance Analysis and Mitigation

When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):

  • Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
  • The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
  • It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
  • It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.

Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.

Thought Experiments on Alternative Assertion Implementation

Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.

The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.

Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.

This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).

A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.

There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.


When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.

In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.

Go [ JSON | Relational ] SQL!


The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.