SQL for JSON and Schema Support (Part 7): Partial Schemas for JSON Documents

This blog concludes the series on JSON schema support in context of databases with a proposal: partial schemas.

Black and White: Rigorous Schema or No Schema At All

So far the discussion around JSON schema support in databases was a black and white (or a all-or-nothing) discussion, aka, a JSON document can either change at any time or it must comply completely with a rigorous schema (structure, data types and possibly values) at any time.

The any-change-at-any-time situation is the “wild west” in terms of structure and data type variation in the sense that code accessing JSON documents has to “protect” itself by e.g. means of assertions. Code has to assume that any structure can be present and therefore has to introspect each document at every access in order to determine if it can process it. Complex error handling is necessary if the code cannot process a specific JSON document (as discussed in an earlier blog).

More than All-Or-Nothing Approach: Rationale for Reconsideration

The ability to change JSON documents randomly is portrait as a desirable and positive feature by many JSON supporting databases, however, from an engineering perspective this “randomness” is adding significant engineering and coding complexity (and significant effort to quality assurance as well).

Are random (“dynamic”) document structures really the 80% case in real situations? Or is in reality a more nuanced situation at hand?

Many documents implement the content and state of business data (functional data). In most cases there are basic data sets to be supported. For example, purchase orders must have an overall order value field, health records must have a unique record identifier including some basic patient data, user accounts must have a unique account identifier and at least an account login name. There is usually a core set of attributes that represents the commonly agreed set of business data and those are mandatory as well in order for the business data to be meaningful.

Each (business) domain has its own commonly agreed core semantics and therefore commonly agreed data attributes. Code should be able to assume that the core semantics is complied to and that the core attributes are present in every JSON document representing the domain semantics.

So maybe it is not it really all-or-nothing in terms of the dynamic structure of JSON documents all the time. Maybe it would be really useful to be able to specify a schema that applies to (only) a part of a JSON document at any time without enforcing complete compliance: partial schemas.

Partial Schema Rationalization

If a JSON document must be partially compliant, then one part of it must comply to a given partial schema, and it can have any form outside of the part governed by the partial schema.

For example, a partial schema for a purchase order could enforce that a purchase order must have an purchase order identifier, a buyer identifier, a set of line items, and a purchase order amount at any point in time. A purchase order could contain additional supplemental data that are not governed by the schema (the dynamic part).

With a rigorous partial schema in place code is guaranteed that the attributes defined by the schema are present. This makes the executions of assertions unnecessary and code can be sure that all documents are compliant with the partial schema.

Optionally a partial schema could define that some attributes must not always be present, but in case they are present, they have to comply to a given schema.

For example, if the full specification of a buyer is provided, not just the buyer identifier, then it must contain the buyer’s legal name, the legally registered address as well as a phone number that reaches the buyer. Similar for a shipping address: if one is provided, it has to be a full shipping address denoting a physical location.

With such a schema in place, code can check if specific optional items are present, and if they are, their schema is guaranteed, meaning, these items comply with their definitions in the partial schema.

Finally, a partial schema does not govern any structure outside the mandatory or optional items and it is possible for a JSON document to contain additional data that is not enforced in terms of presence or structure.

For example, items not enforced by a schema could be supporting documentation for line items like drawings, notes to shippers asking for specific handling, or a trustworthiness rating of a buyer in a purchase order. A user account could have a history of logins or a health record could contain the list of doctors that were involved in treatments over time.

For data items that are not enforced, code would have to use assertions in order to establish their presence (and their data types).

Partial Schema Specification

How would a partial schema specification look like? In principle, it could be an extension what has been discussed in earlier blogs of this series with a different semantics. First, any schema would be considered partial, meaning, a JSON document can contain data items that are not contained in the schema in addition to those specified by the partial schema.

Second, each specification in the schema has a “mandatory” and an “optional” designation defining the mandatory items, as well as the optional items. The following schema shows the possible extensions. There are two designations, “mandatory” and “optional” that indicate what the partial schema governs.

{
  "type": "object",
  "mandatory": [
    "orderId",
    "orderDate",
    "orderLineItems"
  ],
  "optional": [
    "buyerId"
  ],
  "properties": {
    "orderId": {
      "type": "number",
      "description": "Order Identifier: must be of 
                      type number and is mandatory"
    },
    "orderDate": {
      "type": "string",
      "description": "Order Date: must be of 
                      type string and is mandatory"
    },
    "orderLineItems": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "itemId": {
            "type": "number"
          },
          "numberOrdered": {
            "type": "number"
          }
        }
      },
      "description": "Order Line Items: must be of 
                      type array and is mandatory"
    },
    "buyer": {
      "type": "number",
      "description": "Buyer Identifier: must be of 
                      type number and is optional"
    }
  }
}

The above schema enforces the structure of JSON documents in part and specifies mandatory as well as optional attributes. There are three mandatory attributes and one optional attribute. Such a schema could be added to an identifiable collection of JSON documents ensuring thereby their partial compliance.

A change in a partial schema cannot take place without modifying the JSON documents as well since the JSON documents must be compliant to a partial schema at any time. Again, code is assured compliance of JSON documents with the attributes defined by the partial schema and code therefore does not have to protect itself.

Summary

This conclusion of the blog series on JSON schema support in databases provided a proposal for partial schemas. Partial schemas support structure and data type guarantees for code as well as support the freedom for dynamic schema changes outside the part of a JSON document that has to be compliant with a partial schema.

Following this approach

  • Code is guaranteed that JSON documents comply to a partial schema
  • An application has the freedom to add additional data dynamically not governed by the partial schema
  • Over time, as dynamic structures in a JSON document stabilize, those can be moved into the stable partial schema part of a JSON document.

Partial schemas are a win-win for both, databases supporting JSON documents as well as code accessing these databases.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisement

SQL for JSON and Schema Support (Part 6): Dynamic Schema Change and Symbolic Computation

How does software engineering look like in context of the absence of schema support? What now?

Summary – Where Are We?

The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.

Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).

Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).

A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).

Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.

So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.

Engineering Approaches

If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.

The following sections discuss one approach of run-time assertions and the phases of processing:

  • Run-time assertions
  • Assertion execution frequency
  • JSON document non-compliance
  • Non-compliance analysis and mitigation

In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.

As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.

Run-Time Assertions

Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.

For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):

assert(exists_path(order.value))
assert(is_of_type(order.value, number) 
       || is_of_type(order.value, string))

Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.

The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.

Assertion Execution Frequency

Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.

There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.

some_function() {
// retrieve document
// assert(...)
// execute logic
}

some_function_2() {
// retrieve document
// assert(...)
// execute logic
}

In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.

// start transaction

// assert()

some_function() {
// retrieve document
// execute logic
}

some_function_2() {
// retrieve document
// execute logic
}

// end transaction

In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.

Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.

Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.

JSON Document Non-Compliance (Violated Assertions)

If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:

  • The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
  • An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
  • Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).

In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.

If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.

Non-Compliance Analysis and Mitigation

When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):

  • Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
  • The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
  • It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
  • It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.

Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.

Thought Experiments on Alternative Assertion Implementation

Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.

The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.

Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.

This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).

A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.

There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.

Summary

When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.

In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 5): Intermezzo 3 – MongoDB’s $jsonschema

The previous blog discussed MongoDB’s $jsonschema behavior with a strict validation level. Let’s look at the moderate validation level in this blog.

Example

As usual, first, let’s create a collection and add a few JSON documents to it. Afterwards a schema validation is added with the moderate setting (the following is based on MongoDB version 3.6.1).

> mongo
> use moderate_exploration

Initially, before adding a schema, two JSON objects are inserted that are not compliant with the schema that is going to be added afterwards. The reason is that we need non-compliant JSON objects to discuss the moderate level later.

> db.orders.insert({
   "orderId": 1,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 20
    }, {
    "itemId": 56,
    "numberOrdered": 21
   }],
   "specialInstructions": "Drop of in front, 
                           not back of location"
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": 2,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

Now the schema is added:

> db.runCommand({ 
   "collMod": "orders",
   "validator": {  
    "$jsonSchema": {   
      "bsonType": "object",
       "required": ["orderId", "orderDate", "orderLineItems"],
       "properties": {
        "orderId": { 
         "bsonType": "int",
         "description": "Order Identifier: must be of 
                         type int and is required"
        },
        "orderDate": { 
         "bsonType": "date",
         "description": "Order Date: must be of 
                         type date and is required"
        },
        "orderLineItems": { 
         "bsonType": "array",
         "items": {  
          "bsonType": "object",
          "properties": {   
           "itemId": {    
           "bsonType": "int"   
           },
           "numberOrdered": {    
           "bsonType": "int"   
           }  
          } 
         },
         "description": "Order Line Items: must be of 
                         type array and is required"
      }   
     }  
    } 
   },
   "validationLevel": "moderate",
   "validationAction": "error"
  })
{ "ok" : 1 }

After the schema is added, two more JSON objects are inserted, this time being schema compliant.

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(60)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(61)
   }]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(80)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(81)
   }]
  })
WriteResult({ "nInserted" : 1 })

At this point the created collection is governed by a schema, and contains four JSON documents, two are compliant with the schema (orderId 3 and 4), and two are not compliant (orderId 1 and 2).

Analysis

The MongoDB documentation states for “moderate”: “Apply validation rules to inserts and to updates on existing valid documents. Do not apply rules to updates on existing invalid documents.” (https://docs.mongodb.com/manual/reference/command/collMod/#validationLevel).

Let’s explore now the behavior of the moderate validation level.

First, let’s try to insert a non-compliant JSON document. The insert will fail as expected:

> db.orders.insert({
   "orderId": 5,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Second, let’s try to update a compliant JSON document that already exists in the collection in a non-compliant way:

> db.orders.update({  
   "orderId": NumberInt(3) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-09"  
   } 
  })

As expected the update fails:

WriteResult({
 "nMatched" : 0,
 "nUpserted" : 0,
 "nModified" : 0,
 "writeError" : {
  "code" : 121,
  "errmsg" : "Document failed validation"
 }
})

Third, let’s try to update a non-compliant JSON document

> db.orders.update({  
   "orderId": NumberInt(1) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-10"  
   } 
  })

As per the above explanation of moderate this should work and indeed it does:

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Bypassing Validation

With the correct permission (https://docs.mongodb.com/manual/reference/privilege-actions/#bypassDocumentValidation) it is possible to bypass document validation.

This allows for the situation that e.g. a collection is governed by a new schema, however, existing application code might have to continue to insert or to update documents with a structure that violates the new schema as the logic cannot be adjusted to the new schema quickly enough (including transforming the non-compliant to compliant JSON documents).

Summary

The brief analysis of MongoDB wrt. document validation in context of JSON schemas added to collections in the last three blogs showed that while schema supervision is possible, it is not as strict as in relational database management systems.

Basically, if a schema is present, a user cannot infer that all documents in that collection comply to that schema. A schema related to a collection can be changed, and existing documents that would violate the new schema on insert will not be discarded from the collection. Furthermore, properties that are not covered by the schema can be added and changed freely.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 4): Intermezzo 2 – MongoDB’s $jsonschema

After some initial exploration in the previous blog, more aspects on MongoDB’s $jsonschema are looked at in the following.

Example

First, let’s create a collection as follows. It is governed by a schema, and validation is in the strictest setting (the following is based on MongoDB version 3.6.0).

> mongo
> use more_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict",
  "validationAction": "error"
 })
{ "ok" : 1 }

The two documents from the example outlined in the initial blog of series are added next.

> db.orders.insert({
   "orderId": NumberInt(1),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(20)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(21)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(2),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(30)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(31)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })

Insert Strictness and Partial Schema Coverage

The validator is in place on the collection “orders”. This can be verified with the command

> db.getCollectionInfos({name: "orders"})

Now let’s try and add a document that has additional properties in addition to those that comply with the schema as follows:

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(40)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(41)
    }
   ],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

It appears that as long as the schema is satisfied, additional properties can be inserted. So the schema is not completely covering the object to be inserted, but only those properties that are defined in the schema (validator). It is a partial schema coverage.

Here is the counter example: the value of the property “orderLineItems” is not in compliance, and so the insertion fails:

> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": ["b", "g"],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Update Strictness and Partial Schema Coverage

The following updates an existing document:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": new Date("2017-10-01")
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

In part 1 of this blog series the order with identifier 1 was updated to add a property “specialInstructions”. This is not schema compliant, however, the update is possible as it does not violate that part of the document that is covered by the schema.

> db.orders.update({
   "orderId": NumberInt(1)
   }, {
   "$set": {
    "specialInstructions": "Drop of in front, 
                           not back of location"
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Partial schema coverage applies to update as well, not just to inserts.

An example of a non-compliant update is the following:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": "2017-09-30"
   }
  })
WriteResult({
 "nMatched": 0,
 "nUpserted": 0,
 "nModified": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Summary

MongoDB supports partial schema coverage in strict mode, meaning, properties defined in the schema must match the schema, however, properties not specified in the schema can be added or modified without rejection.

This means (again) that examining the JSON schema validator of a MongoDB collection only indicates properties common to all documents, but not the complete set of properties of all documents.

The next blog examines the non-strict validation setting of a JSON schema validator in MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 3): Intermezzo 1 – MongoDB’s $jsonschema

MongoDB introduced support for JSON Schema through $jsonschema. Let’s explore this new functionality a bit in this blog.

$jsonschema

The functionality is introduced here: https://docs.mongodb.com/master/reference/operator/query/jsonSchema/#op._S_jsonSchema It states “$jsonSchema can be used in a document validator, which enforces that inserted or updated documents are valid against the schema.”

A first item to note is that this approach is supporting BSON types (http://bsonspec.org/, https://docs.mongodb.com/master/reference/operator/query/type/), not just JSON structures (https://www.json.org/), using a specific property “bsonType” that is not part of the JSON Schema standard (http://json-schema.org/).

A second observation is that the schema specification is inline with the collection creation and cannot refer to a separate JSON schema file or JSON object representing a JSON schema.

JSON Schema Validator Example

Let’s use the example of the first blog in this series, create a schema for it and use that as a constraint for the “orders” collection. Then documents are added to the collection (and there seem to be errors as well). For reference the version used is: MongoDB server version: 3.6.0.

> mongo
> use schema_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "string"
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  }
 })
{ "ok" : 1 }

A quick note: “bsonType” can be used in all levels in order to refer to BSON types, not just on the top level.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": 55,
   "numberOrdered": 30
  }, {
   "itemId": 56,
   "numberOrdered": 31
  }]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Along the way I ran into a validation issue as I constraint the array elements to strings, rather than objects, as used in the example of the first blog in this series. So I made a schema definition mistake.

To note is that the response on the shell does not indicate what the problem was making debugging hard, especially when large and complex schemas are to be debugged.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 1
})

Once I realized the mistake I made, I inserted a document complying to the schema in order to make sure I identified the issue correctly.

JSON Schema Validator Update

Obviously, after defining a wrong schema, the correct schema should be used as validator.

This is the correct schema:

{
 "bsonType": "object",
 "required": ["orderId", "orderDate", "orderLineItems"],
 "properties": {
  "orderId": {
   "bsonType": "int",
   "description": "Order Identifier: must be of 
                  type int and is required"
  },
  "orderDate": {
   "bsonType": "date",
   "description": "Order Date: must be of 
                  type date and is required"
  },
  "orderLineItems": {
   "bsonType": "array",
   "items": {
    "bsonType": "object",
    "properties": {
     "itemId": {
      "bsonType": "int"
     },
     "numberOrdered": {
      "bsonType": "int"
     }
    }
   },
   "description": "Order Line Items: must be of 
                  type array and is required"
   }
  }
 }

And this is the command to update the validator:

> db.runCommand({
  "collMod": "orders",
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict"
 })
{ "ok" : 1 }

Some background on the command used is here: https://docs.mongodb.com/master/reference/command/collMod/.

Following is an attempt to add one more of the (now mismatching) documents:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

As it should be, the insert fails.

And here the insert of a now correct document:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": NumberInt(55),
   "numberOrdered": NumberInt(20)
  }, {
   "itemId": NumberInt(56),
   "numberOrdered": NumberInt(21)
  }]
 });
WriteResult({
 "nInserted": 1
})

Collection Inconsistency: Mismatch of Schema and Documents

There is an interesting issue appearing at this point. The new schema does not match all existing documents in the collection. Or the other way around: the collection now contains documents that do not match that schema.

> db.orders.find()
{
 "_id": ObjectId("5a2022c3fb460d15db9ec73e"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": ["a", "b"]
} {
 "_id": ObjectId("5a202322fb460d15db9ec741"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": [{
  "itemId": 55,
  "numberOrdered": 20
 }, {
  "itemId": 56,
  "numberOrdered": 21
 }]
}

MongoDB did not flag that there are documents in the collection that will not match the new schema (even though the validation level strict was used).

Adding a validation action with value of “error” does not change the situation, either.

Implication to Semantics

Given that the schema of a collection can be changed at any time, and given that MongoDB does not fail the schema update based on mismatching documents already in the collection, examining the schema is insufficient to understand the structure of the documents in a collection.

So a collection with a schema does not ensure that all documents in that collection are schema compliant. It rather insures that from the point in time the schema was added or updated documents will have to comply. Previous documents in the collections are not affected.

Summary

The notion of “schema” in context of MongoDB is very different from the notion of “schema” in context of relational database management systems. In MongoDB the documents in the collection do not have to comply to the schema; they only do have to comply at time of insertion.

There will be more exploration coming up in the next blog on this topic in order to further understand the semantics of “schema” in context of MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 2): Where does the “Interesting” Code go?

The previous blog found that the “generic” indirect representation of JSON data is one way of supporting “schema-free” JSON objects or documents. Where does the “interesting” functional code live?

Indirect Representation

To recap, the indirect representation is a set of classes, functions, etc. (depending on programming language) that can manage JSON objects or JSON documents. All or most languages have libraries supporting JSON manipulation. For example, Jackson is such a library for Java.

These JSON libraries can manage any valid JSON structure, and they do not require a schema or the JSON objects being homogeneous. Two JSON objects representing the same concept like an order with different attributes (as shown in the previous blog) can be managed by such JSON libraries.

Structural Manipulation

Structural manipulation of JSON objects supports the addition, update or deletion of properties (members) as well as JSON array elements. Property values can be replaced, for example, a JSON string with a JSON object.

Through structural manipulation it is possible to change a JSON object as needed, when e.g. new details appear in form of additional properties.

Structural manipulation was demonstrated in a database context in the last blog: properties were added through the update statement. The same is possible in the indirect representation libraries in the various programming languages.

Computation

Structural manipulation is not the only code that is required as structural manipulation does not allow to compute any specific application semantics. For example, in context of orders, the total value of not yet shipped orders might be a value that needs to be computed.

In a database context this would be an aggregation query that sums up the amount of all orders that do not have the status of shipped.

In context of a programming language it would require a function that iterates through all orders and, like in the database aggregation approach, adds up the sum of those orders that have not shipped yet.

It probably would be implemented as a set of cooperating functions, like

DollarAmount getValueOfOrdersNotShipped(JSONArray orders)
boolean hasOrderShipped(JSONObject order)
DollarAmount getValueOfOrder(JSONObject order)

JSONArray as well as JSONObject are an example of an indirect representation holding order data as a JSON structure.

Note: of course, in the absence of a schema (which is assumed here), there is no assurance that the JSONArray or the JSONObject contain only orders or that the orders are homogeneous in structure. There has to be “trust” that this is indeed the case.

If validation is desired, and if no schema is available, then the only alternative is validating values in one or more JSON object properties. For example, order identifiers might be of a specific structure that uniquely identifies an identifier being an order identifier. This would require trust that the algorithms creating identifiers are correct.

Separation of Manipulation and Computation

The JSON libraries supporting the indirect representation are separate from the functional code (like the summing up of order values). The software architecture and design has to structure this separation and ideally ensures that all functions concerned with orders are “close” from a code structure or software architecture perspective.

There might be functions that can be reused across different concepts (like orders, returns, shipments, etc.), and they can be refactored out, of course, as in “normal” functional code.

Given the above rationalization, how does the absence of a schema come into the picture?

Implication of Schema Free JSON Objects

Since there is no schema, JSON objects can have a different structure even though they represent the same concepts. In context of orders,  let’s look at two use cases:

  • An order does not have a shipping status
  • An order does have a value but in a variety of data types

In a world without schema these are possible use cases and the functional code needs to check for those.

Addressing the first use case can be accomplished by checking for existence. Code can check if a property is present and react accordingly. In the above example, the code designer can choose to have hasOrderShipped() return false or throw an error in case there is no shipping status.

The second use case can be addressed by checking for the type of the value of the order. If possible, value transformations can be implemented in getValueOfOrder(), e.g., string to number; if it is not possible to transform, an error can be thrown.

Summary

In a schema free JSON context there are several aspects from a code perspective: functional code implementing application semantics is separate from the code that manages the structure of JSON objects. That separation must be carefully managed from an architectural perspective.

The functional code must anticipate non-homogeneous JSON objects and check for variation in order to be able to implement the functionality accurately.

But wait, there is more:-) The next blog will venture into more nuances.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 1): Preliminaries

Missing schema support and schema enforcement is touted as a good thing: is it really?

“Schema Free”, “Flexible Schema”, “Schema Per Document”

What is meant by a database supporting the JSON data structure without providing support and enforcement for schemas? Such a database is often characterized as “schema free”, or supporting a “flexible schema” or “schema per document”. What does it mean?

No matter how such a database supporting JSON it is labelled, it does not provide an interface to define, to manage or to enforce schema(s) for the data, aka JSON documents, it is managing (“enforcing” is used in the semantics a relational database enforces a schema). This means that a client (e.g., application code) can store JSON documents that have any form as long as those comply to the JSON (syntax) standard (and possibly proprietary extensions by the database system).

It furthermore means that JSON documents representing instances of the same concepts (like e.g. orders or games or employees) do not have to have the same structure. Those JSON documents can be different from each other, not only in values, but also in structure.

An example follows of a possible scenario (using MongoDB).

Example

The example stores initially two documents that have the same structure, and subsequently their structure diverges through updates. No schema enforcement prevents the changes.

use blog;
db.blogColl.insert({
  "orderId": 1,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 20
    },
    {
      "itemId": 56,
      "numberOrdered": 21
  }]
});
db.blogColl.insert({
  "orderId": 2,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 30
    }, 
    {
      "itemId": 56,
      "numberOrdered": 31
  }]
});
db.blogColl.update({
    "orderId": 1
  }, {
    "$set": {
    "specialInstructions": 
      "Drop of in front, not back of location"
  }
});
db.blogColl.update({
    "orderId": 2,
    "orderLineItems.itemId": 55
  }, {
    $set: {
      "orderLineItems.$.color": "transparent"
  }
});

The ability to store different JSON documents with different structures, even if they represent (instances of) the same concept, can be seen as a powerful feature. It allows modifying the data as needed to represent changing requirements or specific representation needs. Data migration is easier, too, as data can be changed in place.

This flexibility also has downsides and (engineering) cost that need to be considered and dealt with in a concrete implementation.

Application Implementation

Application code accessing a database has a full or partial representation of the data it queries (or in general manages) in the type system of the deployed programming language.

There are basically two choices an application (short for application code) has to represent data:

  • Direct representation
  • Indirect representation

In a direct representation the concept as stored in the database is defined as data structure in the programming language. For example, using Java as the programming language example, an order is represented as a Java class “Order”. This class has all the methods required to access the various elements of an order (that might be implemented as Java classes themselves). In this approach an order stored in the database, when queried, will be managed as an instance of the Java class Order in the application code. Methods support access or modification to the instance of order, and the methods are order semantics specific, like getOrderDate() or updateOrderLineItem() or totalNumberInidividualItems().

In the indirect representation, an order would be represented not as instance of a Java class that reflects the concept, but an instance of a “meta” Java class. This “meta” class is able to store all data from the database, not just orders. Such a class would have methods like createInstance(), setIdentifier(), setType(), addAttribute(), etc. A type would be “Order”, an attribute would be “lineItem”, etc.

Consequences Of Choice

Given the two representations discussed earlier (direct and indirect), the choice seems to be clear. While the direct representation can capture the semantics of a concept directly (aka, a Java class “Order” can implement order specific methods), this approach would not be able to easily (or at all) deal with changes in the database representation of orders. For example, if an additional attribute is added to the JSON document representing an order (as shown above), the Java class would not be able to change dynamically and capture it.

The indirect representation, however, would not have any problems representing order JSON objects with different structure as e.g. attributes can be dynamically added. This means that instances of the “meta” Java class can represent any JSON document as stored in the database.

These “meta” Java classes are actually already available in form of JSON processing libraries. Such libraries support the creation of JSON structure representations and they can represent any JSON object (or JSON array) that implements a correct JSON syntax.

Summary

At a first glance it looks like databases that support JSON without enforcing a schema at the same time are a good choice for ease of data management. And, using the indirect representation approach applications can deal with dynamically changing JSON objects or JSON objects of different structure representing the same concept.

However, as always, there are more details to discuss and additional aspects are going to be examined in the next blog.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

NoSQL Databases: Data First, Schema Second? Or Vice Versa?

When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?

Global vs. Local vs. Mixed Document Schema

One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.

A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.

A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.

In terms of system examples:

  • Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
    • Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
  • MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
    • There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
  • FoundationDB [https://foundationdb.com/] follows the global schema approach.
    • FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
  • Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
    • A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.

Explicit vs. Implicit Document Schema

A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).

In terms of system examples:

  • Oracle NoSQL: supports both, explicit and implicit schemas
  • MongoDB: implicit schema
  • FoundationDB: explicit schema
  • Oracle 12c: implicit schema

From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.

The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.

Mandatory Document Schema

A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.

In terms of system examples:

  • Oracle NoSQL: no mandatory schema
  • MongoDB: no mandatory schema
  • FoundationDB: requires a mandatory schema
  • Oracle 12c: no mandatory schema

Roles of Database Users

Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:

  • Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
  • Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.

A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.

Schema First or Schema Second?

From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:

  • Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
    • local schema – implicit schema – not mandatory schema (‘schema second’)
  • Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
    • global schema – explicit schema – mandatory schema (‘schema first’)

The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.

A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.

A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.

Co-existence of Schema First and Schema Second?

In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:

  • ETL (extract – transform – load) from data collector database to data user database
    • This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
  • View on data collector database
    • This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
  • Automatic schema extraction
    • A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
  • Intelligent ORM Layer
    • An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.

From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.

If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.

Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.

Schema-Varying Languages?

It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?

While this is a huge topic on its own, server-side JavaScript might be a good place to start as the language is not based on a class/instance paradigm, but prototype approach. It’s type system is almost equivalent with JSON. The language, therefore, is able to represent documents with local schema easily and effortlessly. Since JavaScript can introspect objects and since it implements the prototype mechanism it is possible to represent local functionality for documents with a local schema. One of the bigger questions is how to represent this flexibility to the end user on user interfaces in an ergonomic way.

However, this is a discussion on its own and I’ll save it for a later blog.

Summary

When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.

So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.