SQL for JSON and Schema Support (Part 7): Partial Schemas for JSON Documents

March 29, 2018 by chbussler

This blog concludes the series on JSON schema support in context of databases with a proposal: partial schemas.

Black and White: Rigorous Schema or No Schema At All

So far the discussion around JSON schema support in databases was a black and white (or a all-or-nothing) discussion, aka, a JSON document can either change at any time or it must comply completely with a rigorous schema (structure, data types and possibly values) at any time.

The any-change-at-any-time situation is the “wild west” in terms of structure and data type variation in the sense that code accessing JSON documents has to “protect” itself by e.g. means of assertions. Code has to assume that any structure can be present and therefore has to introspect each document at every access in order to determine if it can process it. Complex error handling is necessary if the code cannot process a specific JSON document (as discussed in an earlier blog).

More than All-Or-Nothing Approach: Rationale for Reconsideration

The ability to change JSON documents randomly is portrait as a desirable and positive feature by many JSON supporting databases, however, from an engineering perspective this “randomness” is adding significant engineering and coding complexity (and significant effort to quality assurance as well).

Are random (“dynamic”) document structures really the 80% case in real situations? Or is in reality a more nuanced situation at hand?

Many documents implement the content and state of business data (functional data). In most cases there are basic data sets to be supported. For example, purchase orders must have an overall order value field, health records must have a unique record identifier including some basic patient data, user accounts must have a unique account identifier and at least an account login name. There is usually a core set of attributes that represents the commonly agreed set of business data and those are mandatory as well in order for the business data to be meaningful.

Each (business) domain has its own commonly agreed core semantics and therefore commonly agreed data attributes. Code should be able to assume that the core semantics is complied to and that the core attributes are present in every JSON document representing the domain semantics.

So maybe it is not it really all-or-nothing in terms of the dynamic structure of JSON documents all the time. Maybe it would be really useful to be able to specify a schema that applies to (only) a part of a JSON document at any time without enforcing complete compliance: partial schemas.

Partial Schema Rationalization

If a JSON document must be partially compliant, then one part of it must comply to a given partial schema, and it can have any form outside of the part governed by the partial schema.

For example, a partial schema for a purchase order could enforce that a purchase order must have an purchase order identifier, a buyer identifier, a set of line items, and a purchase order amount at any point in time. A purchase order could contain additional supplemental data that are not governed by the schema (the dynamic part).

With a rigorous partial schema in place code is guaranteed that the attributes defined by the schema are present. This makes the executions of assertions unnecessary and code can be sure that all documents are compliant with the partial schema.

Optionally a partial schema could define that some attributes must not always be present, but in case they are present, they have to comply to a given schema.

For example, if the full specification of a buyer is provided, not just the buyer identifier, then it must contain the buyer’s legal name, the legally registered address as well as a phone number that reaches the buyer. Similar for a shipping address: if one is provided, it has to be a full shipping address denoting a physical location.

With such a schema in place, code can check if specific optional items are present, and if they are, their schema is guaranteed, meaning, these items comply with their definitions in the partial schema.

Finally, a partial schema does not govern any structure outside the mandatory or optional items and it is possible for a JSON document to contain additional data that is not enforced in terms of presence or structure.

For example, items not enforced by a schema could be supporting documentation for line items like drawings, notes to shippers asking for specific handling, or a trustworthiness rating of a buyer in a purchase order. A user account could have a history of logins or a health record could contain the list of doctors that were involved in treatments over time.

For data items that are not enforced, code would have to use assertions in order to establish their presence (and their data types).

Partial Schema Specification

How would a partial schema specification look like? In principle, it could be an extension what has been discussed in earlier blogs of this series with a different semantics. First, any schema would be considered partial, meaning, a JSON document can contain data items that are not contained in the schema in addition to those specified by the partial schema.

Second, each specification in the schema has a “mandatory” and an “optional” designation defining the mandatory items, as well as the optional items. The following schema shows the possible extensions. There are two designations, “mandatory” and “optional” that indicate what the partial schema governs.

{
  "type": "object",
  "mandatory": [
    "orderId",
    "orderDate",
    "orderLineItems"
  ],
  "optional": [
    "buyerId"
  ],
  "properties": {
    "orderId": {
      "type": "number",
      "description": "Order Identifier: must be of 
                      type number and is mandatory"
    },
    "orderDate": {
      "type": "string",
      "description": "Order Date: must be of 
                      type string and is mandatory"
    },
    "orderLineItems": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "itemId": {
            "type": "number"
          },
          "numberOrdered": {
            "type": "number"
          }
        }
      },
      "description": "Order Line Items: must be of 
                      type array and is mandatory"
    },
    "buyer": {
      "type": "number",
      "description": "Buyer Identifier: must be of 
                      type number and is optional"
    }
  }
}

The above schema enforces the structure of JSON documents in part and specifies mandatory as well as optional attributes. There are three mandatory attributes and one optional attribute. Such a schema could be added to an identifiable collection of JSON documents ensuring thereby their partial compliance.

A change in a partial schema cannot take place without modifying the JSON documents as well since the JSON documents must be compliant to a partial schema at any time. Again, code is assured compliance of JSON documents with the attributes defined by the partial schema and code therefore does not have to protect itself.

Summary

This conclusion of the blog series on JSON schema support in databases provided a proposal for partial schemas. Partial schemas support structure and data type guarantees for code as well as support the freedom for dynamic schema changes outside the part of a JSON document that has to be compliant with a partial schema.

Following this approach

Code is guaranteed that JSON documents comply to a partial schema
An application has the freedom to add additional data dynamically not governed by the partial schema
Over time, as dynamic structures in a JSON document stabilize, those can be moved into the stable partial schema part of a JSON document.

Partial schemas are a win-win for both, databases supporting JSON documents as well as code accessing these databases.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 6): Dynamic Schema Change and Symbolic Computation

February 27, 2018 by chbussler

How does software engineering look like in context of the absence of schema support? What now?

Summary – Where Are We?

The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.

Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).

Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).

A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).

Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.

So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.

Engineering Approaches

If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.

The following sections discuss one approach of run-time assertions and the phases of processing:

Run-time assertions
Assertion execution frequency
JSON document non-compliance
Non-compliance analysis and mitigation

In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.

As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.

Run-Time Assertions

Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.

For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):

assert(exists_path(order.value))
assert(is_of_type(order.value, number) 
       || is_of_type(order.value, string))

Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.

The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.

Assertion Execution Frequency

Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.

There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.

some_function() {
// retrieve document
// assert(...)
// execute logic
}

some_function_2() {
// retrieve document
// assert(...)
// execute logic
}

In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.

// start transaction

// assert()

some_function() {
// retrieve document
// execute logic
}

some_function_2() {
// retrieve document
// execute logic
}

// end transaction

In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.

Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.

Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.

JSON Document Non-Compliance (Violated Assertions)

If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:

The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).

In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.

If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.

Non-Compliance Analysis and Mitigation

When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):

Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.

Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.

Thought Experiments on Alternative Assertion Implementation

Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.

The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.

Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.

This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).

A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.

There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.

Summary

When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.

In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 5): Intermezzo 3 – MongoDB’s $jsonschema

January 29, 2018 by chbussler

The previous blog discussed MongoDB’s $jsonschema behavior with a strict validation level. Let’s look at the moderate validation level in this blog.

Example

As usual, first, let’s create a collection and add a few JSON documents to it. Afterwards a schema validation is added with the moderate setting (the following is based on MongoDB version 3.6.1).

> mongo

> use moderate_exploration

Initially, before adding a schema, two JSON objects are inserted that are not compliant with the schema that is going to be added afterwards. The reason is that we need non-compliant JSON objects to discuss the moderate level later.

> db.orders.insert({
   "orderId": 1,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 20
    }, {
    "itemId": 56,
    "numberOrdered": 21
   }],
   "specialInstructions": "Drop of in front, 
                           not back of location"
  })

WriteResult({ "nInserted" : 1 })

> db.orders.insert({
   "orderId": 2,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })

WriteResult({ "nInserted" : 1 })

Now the schema is added:

> db.runCommand({ 
   "collMod": "orders",
   "validator": {  
    "$jsonSchema": {   
      "bsonType": "object",
       "required": ["orderId", "orderDate", "orderLineItems"],
       "properties": {
        "orderId": { 
         "bsonType": "int",
         "description": "Order Identifier: must be of 
                         type int and is required"
        },
        "orderDate": { 
         "bsonType": "date",
         "description": "Order Date: must be of 
                         type date and is required"
        },
        "orderLineItems": { 
         "bsonType": "array",
         "items": {  
          "bsonType": "object",
          "properties": {   
           "itemId": {    
           "bsonType": "int"   
           },
           "numberOrdered": {    
           "bsonType": "int"   
           }  
          } 
         },
         "description": "Order Line Items: must be of 
                         type array and is required"
      }   
     }  
    } 
   },
   "validationLevel": "moderate",
   "validationAction": "error"
  })

{ "ok" : 1 }

After the schema is added, two more JSON objects are inserted, this time being schema compliant.

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(60)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(61)
   }]
  })

WriteResult({ "nInserted" : 1 })

> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(80)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(81)
   }]
  })

WriteResult({ "nInserted" : 1 })

At this point the created collection is governed by a schema, and contains four JSON documents, two are compliant with the schema (orderId 3 and 4), and two are not compliant (orderId 1 and 2).

Analysis

The MongoDB documentation states for “moderate”: “Apply validation rules to inserts and to updates on existing valid documents. Do not apply rules to updates on existing invalid documents.” (https://docs.mongodb.com/manual/reference/command/collMod/#validationLevel).

Let’s explore now the behavior of the moderate validation level.

First, let’s try to insert a non-compliant JSON document. The insert will fail as expected:

> db.orders.insert({
   "orderId": 5,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })

WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Second, let’s try to update a compliant JSON document that already exists in the collection in a non-compliant way:

> db.orders.update({  
   "orderId": NumberInt(3) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-09"  
   } 
  })

As expected the update fails:

WriteResult({
 "nMatched" : 0,
 "nUpserted" : 0,
 "nModified" : 0,
 "writeError" : {
  "code" : 121,
  "errmsg" : "Document failed validation"
 }
})

Third, let’s try to update a non-compliant JSON document

> db.orders.update({  
   "orderId": NumberInt(1) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-10"  
   } 
  })

As per the above explanation of moderate this should work and indeed it does:

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Bypassing Validation

With the correct permission (https://docs.mongodb.com/manual/reference/privilege-actions/#bypassDocumentValidation) it is possible to bypass document validation.

This allows for the situation that e.g. a collection is governed by a new schema, however, existing application code might have to continue to insert or to update documents with a structure that violates the new schema as the logic cannot be adjusted to the new schema quickly enough (including transforming the non-compliant to compliant JSON documents).

Summary

The brief analysis of MongoDB wrt. document validation in context of JSON schemas added to collections in the last three blogs showed that while schema supervision is possible, it is not as strict as in relational database management systems.

Basically, if a schema is present, a user cannot infer that all documents in that collection comply to that schema. A schema related to a collection can be changed, and existing documents that would violate the new schema on insert will not be discarded from the collection. Furthermore, properties that are not covered by the schema can be added and changed freely.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 4): Intermezzo 2 – MongoDB’s $jsonschema

December 19, 2017 by chbussler

After some initial exploration in the previous blog, more aspects on MongoDB’s $jsonschema are looked at in the following.

Example

First, let’s create a collection as follows. It is governed by a schema, and validation is in the strictest setting (the following is based on MongoDB version 3.6.0).

> mongo

> use more_exploration

> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict",
  "validationAction": "error"
 })

{ "ok" : 1 }

The two documents from the example outlined in the initial blog of series are added next.

> db.orders.insert({
   "orderId": NumberInt(1),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(20)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(21)
    }
   ]
  })

WriteResult({ "nInserted" : 1 })

> db.orders.insert({
   "orderId": NumberInt(2),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(30)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(31)
    }
   ]
  })

WriteResult({ "nInserted" : 1 })

Insert Strictness and Partial Schema Coverage

The validator is in place on the collection “orders”. This can be verified with the command

> db.getCollectionInfos({name: "orders"})

Now let’s try and add a document that has additional properties in addition to those that comply with the schema as follows:

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(40)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(41)
    }
   ],
   "preferredColor": "red"
  })

WriteResult({ "nInserted" : 1 })

It appears that as long as the schema is satisfied, additional properties can be inserted. So the schema is not completely covering the object to be inserted, but only those properties that are defined in the schema (validator). It is a partial schema coverage.

Here is the counter example: the value of the property “orderLineItems” is not in compliance, and so the insertion fails:

> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": ["b", "g"],
   "preferredColor": "red"
  })

WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Update Strictness and Partial Schema Coverage

The following updates an existing document:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": new Date("2017-10-01")
   }
  })

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

In part 1 of this blog series the order with identifier 1 was updated to add a property “specialInstructions”. This is not schema compliant, however, the update is possible as it does not violate that part of the document that is covered by the schema.

> db.orders.update({
   "orderId": NumberInt(1)
   }, {
   "$set": {
    "specialInstructions": "Drop of in front, 
                           not back of location"
   }
  })

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Partial schema coverage applies to update as well, not just to inserts.

An example of a non-compliant update is the following:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": "2017-09-30"
   }
  })

WriteResult({
 "nMatched": 0,
 "nUpserted": 0,
 "nModified": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Summary

MongoDB supports partial schema coverage in strict mode, meaning, properties defined in the schema must match the schema, however, properties not specified in the schema can be added or modified without rejection.

This means (again) that examining the JSON schema validator of a MongoDB collection only indicates properties common to all documents, but not the complete set of properties of all documents.

The next blog examines the non-strict validation setting of a JSON schema validator in MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 3): Intermezzo 1 – MongoDB’s $jsonschema

November 30, 2017 by chbussler

MongoDB introduced support for JSON Schema through $jsonschema. Let’s explore this new functionality a bit in this blog.

$jsonschema

The functionality is introduced here: https://docs.mongodb.com/master/reference/operator/query/jsonSchema/#op._S_jsonSchema It states “$jsonSchema can be used in a document validator, which enforces that inserted or updated documents are valid against the schema.”

A first item to note is that this approach is supporting BSON types (http://bsonspec.org/, https://docs.mongodb.com/master/reference/operator/query/type/), not just JSON structures (https://www.json.org/), using a specific property “bsonType” that is not part of the JSON Schema standard (http://json-schema.org/).

A second observation is that the schema specification is inline with the collection creation and cannot refer to a separate JSON schema file or JSON object representing a JSON schema.

JSON Schema Validator Example

Let’s use the example of the first blog in this series, create a schema for it and use that as a constraint for the “orders” collection. Then documents are added to the collection (and there seem to be errors as well). For reference the version used is: MongoDB server version: 3.6.0.

> mongo

> use schema_exploration

> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "string"
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  }
 })

{ "ok" : 1 }

A quick note: “bsonType” can be used in all levels in order to refer to BSON types, not just on the top level.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": 55,
   "numberOrdered": 30
  }, {
   "itemId": 56,
   "numberOrdered": 31
  }]
 })

WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Along the way I ran into a validation issue as I constraint the array elements to strings, rather than objects, as used in the example of the first blog in this series. So I made a schema definition mistake.

To note is that the response on the shell does not indicate what the problem was making debugging hard, especially when large and complex schemas are to be debugged.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })

WriteResult({
 "nInserted": 1
})

Once I realized the mistake I made, I inserted a document complying to the schema in order to make sure I identified the issue correctly.

JSON Schema Validator Update

Obviously, after defining a wrong schema, the correct schema should be used as validator.

This is the correct schema:

{
 "bsonType": "object",
 "required": ["orderId", "orderDate", "orderLineItems"],
 "properties": {
  "orderId": {
   "bsonType": "int",
   "description": "Order Identifier: must be of 
                  type int and is required"
  },
  "orderDate": {
   "bsonType": "date",
   "description": "Order Date: must be of 
                  type date and is required"
  },
  "orderLineItems": {
   "bsonType": "array",
   "items": {
    "bsonType": "object",
    "properties": {
     "itemId": {
      "bsonType": "int"
     },
     "numberOrdered": {
      "bsonType": "int"
     }
    }
   },
   "description": "Order Line Items: must be of 
                  type array and is required"
   }
  }
 }

And this is the command to update the validator:

> db.runCommand({
  "collMod": "orders",
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict"
 })

{ "ok" : 1 }

Some background on the command used is here: https://docs.mongodb.com/master/reference/command/collMod/.

Following is an attempt to add one more of the (now mismatching) documents:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })

WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

As it should be, the insert fails.

And here the insert of a now correct document:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": NumberInt(55),
   "numberOrdered": NumberInt(20)
  }, {
   "itemId": NumberInt(56),
   "numberOrdered": NumberInt(21)
  }]
 });

WriteResult({
 "nInserted": 1
})

Collection Inconsistency: Mismatch of Schema and Documents

There is an interesting issue appearing at this point. The new schema does not match all existing documents in the collection. Or the other way around: the collection now contains documents that do not match that schema.

> db.orders.find()

{
 "_id": ObjectId("5a2022c3fb460d15db9ec73e"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": ["a", "b"]
} {
 "_id": ObjectId("5a202322fb460d15db9ec741"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": [{
  "itemId": 55,
  "numberOrdered": 20
 }, {
  "itemId": 56,
  "numberOrdered": 21
 }]
}

MongoDB did not flag that there are documents in the collection that will not match the new schema (even though the validation level strict was used).

Adding a validation action with value of “error” does not change the situation, either.

Implication to Semantics

Given that the schema of a collection can be changed at any time, and given that MongoDB does not fail the schema update based on mismatching documents already in the collection, examining the schema is insufficient to understand the structure of the documents in a collection.

So a collection with a schema does not ensure that all documents in that collection are schema compliant. It rather insures that from the point in time the schema was added or updated documents will have to comply. Previous documents in the collections are not affected.

Summary

The notion of “schema” in context of MongoDB is very different from the notion of “schema” in context of relational database management systems. In MongoDB the documents in the collection do not have to comply to the schema; they only do have to comply at time of insertion.

There will be more exploration coming up in the next blog on this topic in order to further understand the semantics of “schema” in context of MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 2): Where does the “Interesting” Code go?

November 6, 2017 by chbussler

The previous blog found that the “generic” indirect representation of JSON data is one way of supporting “schema-free” JSON objects or documents. Where does the “interesting” functional code live?

Indirect Representation

To recap, the indirect representation is a set of classes, functions, etc. (depending on programming language) that can manage JSON objects or JSON documents. All or most languages have libraries supporting JSON manipulation. For example, Jackson is such a library for Java.

These JSON libraries can manage any valid JSON structure, and they do not require a schema or the JSON objects being homogeneous. Two JSON objects representing the same concept like an order with different attributes (as shown in the previous blog) can be managed by such JSON libraries.

Structural Manipulation

Structural manipulation of JSON objects supports the addition, update or deletion of properties (members) as well as JSON array elements. Property values can be replaced, for example, a JSON string with a JSON object.

Through structural manipulation it is possible to change a JSON object as needed, when e.g. new details appear in form of additional properties.

Structural manipulation was demonstrated in a database context in the last blog: properties were added through the update statement. The same is possible in the indirect representation libraries in the various programming languages.

Computation

Structural manipulation is not the only code that is required as structural manipulation does not allow to compute any specific application semantics. For example, in context of orders, the total value of not yet shipped orders might be a value that needs to be computed.

In a database context this would be an aggregation query that sums up the amount of all orders that do not have the status of shipped.

In context of a programming language it would require a function that iterates through all orders and, like in the database aggregation approach, adds up the sum of those orders that have not shipped yet.

It probably would be implemented as a set of cooperating functions, like

DollarAmount getValueOfOrdersNotShipped(JSONArray orders)
boolean hasOrderShipped(JSONObject order)
DollarAmount getValueOfOrder(JSONObject order)

JSONArray as well as JSONObject are an example of an indirect representation holding order data as a JSON structure.

Note: of course, in the absence of a schema (which is assumed here), there is no assurance that the JSONArray or the JSONObject contain only orders or that the orders are homogeneous in structure. There has to be “trust” that this is indeed the case.

If validation is desired, and if no schema is available, then the only alternative is validating values in one or more JSON object properties. For example, order identifiers might be of a specific structure that uniquely identifies an identifier being an order identifier. This would require trust that the algorithms creating identifiers are correct.

Separation of Manipulation and Computation

The JSON libraries supporting the indirect representation are separate from the functional code (like the summing up of order values). The software architecture and design has to structure this separation and ideally ensures that all functions concerned with orders are “close” from a code structure or software architecture perspective.

There might be functions that can be reused across different concepts (like orders, returns, shipments, etc.), and they can be refactored out, of course, as in “normal” functional code.

Given the above rationalization, how does the absence of a schema come into the picture?

Implication of Schema Free JSON Objects

Since there is no schema, JSON objects can have a different structure even though they represent the same concepts. In context of orders, let’s look at two use cases:

An order does not have a shipping status
An order does have a value but in a variety of data types

In a world without schema these are possible use cases and the functional code needs to check for those.

Addressing the first use case can be accomplished by checking for existence. Code can check if a property is present and react accordingly. In the above example, the code designer can choose to have hasOrderShipped() return false or throw an error in case there is no shipping status.

The second use case can be addressed by checking for the type of the value of the order. If possible, value transformations can be implemented in getValueOfOrder(), e.g., string to number; if it is not possible to transform, an error can be thrown.

Summary

In a schema free JSON context there are several aspects from a code perspective: functional code implementing application semantics is separate from the code that manages the structure of JSON objects. That separation must be carefully managed from an architectural perspective.

The functional code must anticipate non-homogeneous JSON objects and check for variation in order to be able to implement the functionality accurately.

But wait, there is more:-) The next blog will venture into more nuances.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 1): Preliminaries

September 30, 2017 by chbussler

Missing schema support and schema enforcement is touted as a good thing: is it really?

“Schema Free”, “Flexible Schema”, “Schema Per Document”

What is meant by a database supporting the JSON data structure without providing support and enforcement for schemas? Such a database is often characterized as “schema free”, or supporting a “flexible schema” or “schema per document”. What does it mean?

No matter how such a database supporting JSON it is labelled, it does not provide an interface to define, to manage or to enforce schema(s) for the data, aka JSON documents, it is managing (“enforcing” is used in the semantics a relational database enforces a schema). This means that a client (e.g., application code) can store JSON documents that have any form as long as those comply to the JSON (syntax) standard (and possibly proprietary extensions by the database system).

It furthermore means that JSON documents representing instances of the same concepts (like e.g. orders or games or employees) do not have to have the same structure. Those JSON documents can be different from each other, not only in values, but also in structure.

An example follows of a possible scenario (using MongoDB).

Example

The example stores initially two documents that have the same structure, and subsequently their structure diverges through updates. No schema enforcement prevents the changes.

use blog;

db.blogColl.insert({
  "orderId": 1,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 20
    },
    {
      "itemId": 56,
      "numberOrdered": 21
  }]
});

db.blogColl.insert({
  "orderId": 2,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 30
    }, 
    {
      "itemId": 56,
      "numberOrdered": 31
  }]
});

db.blogColl.update({
    "orderId": 1
  }, {
    "$set": {
    "specialInstructions": 
      "Drop of in front, not back of location"
  }
});

db.blogColl.update({
    "orderId": 2,
    "orderLineItems.itemId": 55
  }, {
    $set: {
      "orderLineItems.$.color": "transparent"
  }
});

The ability to store different JSON documents with different structures, even if they represent (instances of) the same concept, can be seen as a powerful feature. It allows modifying the data as needed to represent changing requirements or specific representation needs. Data migration is easier, too, as data can be changed in place.

This flexibility also has downsides and (engineering) cost that need to be considered and dealt with in a concrete implementation.

Application Implementation

Application code accessing a database has a full or partial representation of the data it queries (or in general manages) in the type system of the deployed programming language.

There are basically two choices an application (short for application code) has to represent data:

Direct representation
Indirect representation

In a direct representation the concept as stored in the database is defined as data structure in the programming language. For example, using Java as the programming language example, an order is represented as a Java class “Order”. This class has all the methods required to access the various elements of an order (that might be implemented as Java classes themselves). In this approach an order stored in the database, when queried, will be managed as an instance of the Java class Order in the application code. Methods support access or modification to the instance of order, and the methods are order semantics specific, like getOrderDate() or updateOrderLineItem() or totalNumberInidividualItems().

In the indirect representation, an order would be represented not as instance of a Java class that reflects the concept, but an instance of a “meta” Java class. This “meta” class is able to store all data from the database, not just orders. Such a class would have methods like createInstance(), setIdentifier(), setType(), addAttribute(), etc. A type would be “Order”, an attribute would be “lineItem”, etc.

Consequences Of Choice

Given the two representations discussed earlier (direct and indirect), the choice seems to be clear. While the direct representation can capture the semantics of a concept directly (aka, a Java class “Order” can implement order specific methods), this approach would not be able to easily (or at all) deal with changes in the database representation of orders. For example, if an additional attribute is added to the JSON document representing an order (as shown above), the Java class would not be able to change dynamically and capture it.

The indirect representation, however, would not have any problems representing order JSON objects with different structure as e.g. attributes can be dynamically added. This means that instances of the “meta” Java class can represent any JSON document as stored in the database.

These “meta” Java classes are actually already available in form of JSON processing libraries. Such libraries support the creation of JSON structure representations and they can represent any JSON object (or JSON array) that implements a correct JSON syntax.

Summary

At a first glance it looks like databases that support JSON without enforcing a schema at the same time are a good choice for ease of data management. And, using the indirect representation approach applications can deal with dynamically changing JSON objects or JSON objects of different structure representing the same concept.

However, as always, there are more details to discuss and additional aspects are going to be examined in the next blog.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Schema-free Database (Part 1): An Oxymoron

November 2, 2013 by chbussler

The notion of a ‘schema-free database’ keeps coming up, most recently in a meetup I attended a few days ago. Some rationalization follows divided up into the categories of ‘document’ and ‘database’.

While a generalization is easily possible, the context here will be JSON and MongoDB as these are two practical implementations that are available and often used as examples of a ‘schema-free database’. Those provide a nice constraint technology set as an example, while the principles apply to a whole range of other technology, of course.

Document

A JSON document, in short: document, follows a set of construction principles outlined here: http://www.json.org/. This is a rather informal grammar that defines how a valid JSON document is constructed. There are no data type generators and so new data types cannot be introduced; therefore, every document is constructed from the fixed set of types enumerated on that web page.

Document Schema

An attempt has been made to create a more formal mechanism to define a schema for JSON documents: http://json-schema.org/. This approach provides a formal language to describe the schema of a JSON document explicitly.

The json-schema approach combined with the fixed set of types available to create a JSON document means that every JSON document can be described explicitly using json-schema without exceptions. This in turns means that every JSON document has at least an implicit schema, unless it is additionally made explicit with e.g. json-schema.

Therefore, JSON documents have a schema, an implicit one and optionally an explicit one. Depending on the particular schema definition language approach itself, a document might match more than one schema, but that is left for a separate discussion.

Set of Document Schemas

Given a set of JSON documents it is now possible to characterize their relationship to schemas. In the ‘best’ case, all documents follow the same schema; in the ‘worst’ case, each document follows its own schema. And there are cases in-between where a subset of the documents validates against a schema, and another subset against another schema. Depending on the design, one JSON document might validate against different schemas.

The relationship between documents and schemas is n:m in general.

Database

In context of a database, there are a few interesting questions in this context:

Does the database understand a document representation (e.g. JSON)?
Does the database enforce a document representation?

And:

Does the database enforce a schema?
Does the database understand a schema?

Let’s answer these questions for MongoDB specifically:

MongoDB understands several document representations, JSON and BSON (http://bsonspec.org/).
MongoDB enforces a document representation. If an attempt is made to insert documents following a representation different from JSON or BSON, MongoDB rejects it (there is a slight exception to this as MongoDB interprets JavaScript constants also: https://realprogrammer.wordpress.com/2012/10/11/null-undefined-nan-and-missing-property-goto-considered-harmful-part-1/)

And:

MongoDB enforces a partial schema. Each document must have a property called “_id”; if the document being inserted does not have such a property, one is automatically added.
MongoDB does not understand an explicit schema as it does not provide for a mechanism to load a schema definition language.

MongoDB, however, understands implicit schemas as MongoDB does allow to e.g. create an index on any property of documents. So MongoDB recognizes properties.

Furthermore, MongoDB supports aggregation functions and supports e.g. the sum of properties across documents (https://realprogrammer.wordpress.com/2012/11/04/null-undefined-nan-and-missing-property-goto-considered-harmful-part-2/). So it is data type aware and implements operators (e.g. sum) on those.

Conclusion: The Notion of ‘Schema’ is Changing

This rather brief discussion clearly rationalizes that the label ‘schema-free database’ is not applicable to technologies such as those discussed in this blog (JSON/BSON, MongoDB).

Hence these technologies are not an example of ‘schema-free database’, to the contrary: they demonstrate that the notion of ‘schema’ can have a wider and more flexible interpretation then what relational databases bring forward.

Schema Derivation: Schema-less vs. Schema-based Documents

July 4, 2012 by chbussler

In the world of document-oriented databases there is the notion that documents are ‘schema-less’ because no schema has to be defined and registered with the database before documents can be stored. Sometimes this is also referred to as ‘schema-free’. Let’s investigate this further: does ‘schema-free’ really exist?

JSON

In MongoDB, for example, documents use the JSON (http://www.json.org/) structure (with an extended set of types as defined in BSON http://bsonspec.org/). So all document in the database follow the same underlying data type structure. The database will reject a document that is not JSON/BSON compliant.

Schema Enforcement vs. Schema Freedom

In a relational database the notion of schema means that the database knows about the schema (e.g. structure of a table) and all data stored must follow that structure. The database management system is the enforcing entity of the schema. Data not following the schema will be rejected.

In a document-oriented system like MongoDB the notion of schema is not known to the database and therefore not enforced by the database. Any document can have any structure as long as it is valid according to JSON.

Explicit Schema vs. Implicit Schema

In the case of relational databases, the schema is explicit as it is possible to retrieve the schema as a separate data structure from the database. Relational databases have system tables that contain their schema. So the database is aware of the schema in such a way that it can be retrieved.

In a document-oriented database it is not possible to inquire the schema as it has not been defined to the database. However, it could be the case that all documents follow a schema (that is defined outside the database). If this is the case, the schema of the documents can be called an implicit schema. It is not necessary for the database to enforce a schema for the documents to comply to one.

Schema-less, Schema-free

The fact that a document-oriented database does not enforce a schema of its documents does not mean that there is no schema. It only means that the database does not enforce it. However, it could be enforced elsewhere. For example, the software that writes the documents can enforce their schema. Or the software that reads the documents enforces the schema by only accepting those documents that comply to its expectations.

But what if a user manually defines a documents and stores it?

Individual Schema vs. Global Schema

A different viewpoint is possible. If a user stores a manually designed document then it is possible to view this document of having an individual schema. It is the only one with that schema. If the user now adds a second document that has the exact same structure, then it is possible to view these as two documents with the same implicit schema. At this point there are two documents following the same (implicit) schema. It got a bit more global, tough, as now two documents are following it. If all documents follow the same (implicit) schema, then there is an (implicit) global schema.

Different documents can follow different (implicit) schemas. In the extreme case, all documents are different from each other in their structure and each has its own schema.

Schema Derivation

JSON is a construction principle for hierarchical data structures. Each document following the JSON model is hierarchical and finite as JSON does not support (cyclic) references.

Based on the hierarchical and finite structure of JSON it is possible to derive the implicit schema of the JSON documents and make it explicit. On a very high level, the schema derivation algorithm iterates over all documents in the database and determines the schema for each of it. Documents that follow the same schema are grouped together. After all documents have been inspected, there will be one (and possibly more) schema and the associated complying documents.

Details of this algorithm will be discussed in a different blog at some later time.

Summary

Documents have a schema, there are not schema-less or schema-free. The schema of a document is implicit (and can be made explicit). Several documents can follow the same schema. While the database is not enforcing the schema, it could be enforced by a separate layer outside the database.

So, not-enforcement of a schema by the database != schema-less/schema-free.