SQL for JSON and Schema Support (Part 6): Dynamic Schema Change and Symbolic Computation

How does software engineering look like in context of the absence of schema support? What now?

Summary – Where Are We?

The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.

Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).

Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).

A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).

Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.

So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.

Engineering Approaches

If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.

The following sections discuss one approach of run-time assertions and the phases of processing:

  • Run-time assertions
  • Assertion execution frequency
  • JSON document non-compliance
  • Non-compliance analysis and mitigation

In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.

As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.

Run-Time Assertions

Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.

For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):

assert(exists_path(order.value))
assert(is_of_type(order.value, number) 
       || is_of_type(order.value, string))

Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.

The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.

Assertion Execution Frequency

Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.

There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.

some_function() {
// retrieve document
// assert(...)
// execute logic
}

some_function_2() {
// retrieve document
// assert(...)
// execute logic
}

In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.

// start transaction

// assert()

some_function() {
// retrieve document
// execute logic
}

some_function_2() {
// retrieve document
// execute logic
}

// end transaction

In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.

Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.

Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.

JSON Document Non-Compliance (Violated Assertions)

If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:

  • The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
  • An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
  • Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).

In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.

If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.

Non-Compliance Analysis and Mitigation

When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):

  • Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
  • The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
  • It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
  • It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.

Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.

Thought Experiments on Alternative Assertion Implementation

Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.

The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.

Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.

This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).

A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.

There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.

Summary

When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.

In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisements

SQL for JSON and Schema Support (Part 5): Intermezzo 3 – MongoDB’s $jsonschema

The previous blog discussed MongoDB’s $jsonschema behavior with a strict validation level. Let’s look at the moderate validation level in this blog.

Example

As usual, first, let’s create a collection and add a few JSON documents to it. Afterwards a schema validation is added with the moderate setting (the following is based on MongoDB version 3.6.1).

> mongo
> use moderate_exploration

Initially, before adding a schema, two JSON objects are inserted that are not compliant with the schema that is going to be added afterwards. The reason is that we need non-compliant JSON objects to discuss the moderate level later.

> db.orders.insert({
   "orderId": 1,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 20
    }, {
    "itemId": 56,
    "numberOrdered": 21
   }],
   "specialInstructions": "Drop of in front, 
                           not back of location"
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": 2,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

Now the schema is added:

> db.runCommand({ 
   "collMod": "orders",
   "validator": {  
    "$jsonSchema": {   
      "bsonType": "object",
       "required": ["orderId", "orderDate", "orderLineItems"],
       "properties": {
        "orderId": { 
         "bsonType": "int",
         "description": "Order Identifier: must be of 
                         type int and is required"
        },
        "orderDate": { 
         "bsonType": "date",
         "description": "Order Date: must be of 
                         type date and is required"
        },
        "orderLineItems": { 
         "bsonType": "array",
         "items": {  
          "bsonType": "object",
          "properties": {   
           "itemId": {    
           "bsonType": "int"   
           },
           "numberOrdered": {    
           "bsonType": "int"   
           }  
          } 
         },
         "description": "Order Line Items: must be of 
                         type array and is required"
      }   
     }  
    } 
   },
   "validationLevel": "moderate",
   "validationAction": "error"
  })
{ "ok" : 1 }

After the schema is added, two more JSON objects are inserted, this time being schema compliant.

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(60)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(61)
   }]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(80)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(81)
   }]
  })
WriteResult({ "nInserted" : 1 })

At this point the created collection is governed by a schema, and contains four JSON documents, two are compliant with the schema (orderId 3 and 4), and two are not compliant (orderId 1 and 2).

Analysis

The MongoDB documentation states for “moderate”: “Apply validation rules to inserts and to updates on existing valid documents. Do not apply rules to updates on existing invalid documents.” (https://docs.mongodb.com/manual/reference/command/collMod/#validationLevel).

Let’s explore now the behavior of the moderate validation level.

First, let’s try to insert a non-compliant JSON document. The insert will fail as expected:

> db.orders.insert({
   "orderId": 5,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Second, let’s try to update a compliant JSON document that already exists in the collection in a non-compliant way:

> db.orders.update({  
   "orderId": NumberInt(3) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-09"  
   } 
  })

As expected the update fails:

WriteResult({
 "nMatched" : 0,
 "nUpserted" : 0,
 "nModified" : 0,
 "writeError" : {
  "code" : 121,
  "errmsg" : "Document failed validation"
 }
})

Third, let’s try to update a non-compliant JSON document

> db.orders.update({  
   "orderId": NumberInt(1) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-10"  
   } 
  })

As per the above explanation of moderate this should work and indeed it does:

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Bypassing Validation

With the correct permission (https://docs.mongodb.com/manual/reference/privilege-actions/#bypassDocumentValidation) it is possible to bypass document validation.

This allows for the situation that e.g. a collection is governed by a new schema, however, existing application code might have to continue to insert or to update documents with a structure that violates the new schema as the logic cannot be adjusted to the new schema quickly enough (including transforming the non-compliant to compliant JSON documents).

Summary

The brief analysis of MongoDB wrt. document validation in context of JSON schemas added to collections in the last three blogs showed that while schema supervision is possible, it is not as strict as in relational database management systems.

Basically, if a schema is present, a user cannot infer that all documents in that collection comply to that schema. A schema related to a collection can be changed, and existing documents that would violate the new schema on insert will not be discarded from the collection. Furthermore, properties that are not covered by the schema can be added and changed freely.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 4): Intermezzo 2 – MongoDB’s $jsonschema

After some initial exploration in the previous blog, more aspects on MongoDB’s $jsonschema are looked at in the following.

Example

First, let’s create a collection as follows. It is governed by a schema, and validation is in the strictest setting (the following is based on MongoDB version 3.6.0).

> mongo
> use more_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict",
  "validationAction": "error"
 })
{ "ok" : 1 }

The two documents from the example outlined in the initial blog of series are added next.

> db.orders.insert({
   "orderId": NumberInt(1),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(20)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(21)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(2),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(30)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(31)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })

Insert Strictness and Partial Schema Coverage

The validator is in place on the collection “orders”. This can be verified with the command

> db.getCollectionInfos({name: "orders"})

Now let’s try and add a document that has additional properties in addition to those that comply with the schema as follows:

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(40)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(41)
    }
   ],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

It appears that as long as the schema is satisfied, additional properties can be inserted. So the schema is not completely covering the object to be inserted, but only those properties that are defined in the schema (validator). It is a partial schema coverage.

Here is the counter example: the value of the property “orderLineItems” is not in compliance, and so the insertion fails:

> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": ["b", "g"],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Update Strictness and Partial Schema Coverage

The following updates an existing document:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": new Date("2017-10-01")
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

In part 1 of this blog series the order with identifier 1 was updated to add a property “specialInstructions”. This is not schema compliant, however, the update is possible as it does not violate that part of the document that is covered by the schema.

> db.orders.update({
   "orderId": NumberInt(1)
   }, {
   "$set": {
    "specialInstructions": "Drop of in front, 
                           not back of location"
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Partial schema coverage applies to update as well, not just to inserts.

An example of a non-compliant update is the following:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": "2017-09-30"
   }
  })
WriteResult({
 "nMatched": 0,
 "nUpserted": 0,
 "nModified": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Summary

MongoDB supports partial schema coverage in strict mode, meaning, properties defined in the schema must match the schema, however, properties not specified in the schema can be added or modified without rejection.

This means (again) that examining the JSON schema validator of a MongoDB collection only indicates properties common to all documents, but not the complete set of properties of all documents.

The next blog examines the non-strict validation setting of a JSON schema validator in MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 3): Intermezzo 1 – MongoDB’s $jsonschema

MongoDB introduced support for JSON Schema through $jsonschema. Let’s explore this new functionality a bit in this blog.

$jsonschema

The functionality is introduced here: https://docs.mongodb.com/master/reference/operator/query/jsonSchema/#op._S_jsonSchema It states “$jsonSchema can be used in a document validator, which enforces that inserted or updated documents are valid against the schema.”

A first item to note is that this approach is supporting BSON types (http://bsonspec.org/, https://docs.mongodb.com/master/reference/operator/query/type/), not just JSON structures (https://www.json.org/), using a specific property “bsonType” that is not part of the JSON Schema standard (http://json-schema.org/).

A second observation is that the schema specification is inline with the collection creation and cannot refer to a separate JSON schema file or JSON object representing a JSON schema.

JSON Schema Validator Example

Let’s use the example of the first blog in this series, create a schema for it and use that as a constraint for the “orders” collection. Then documents are added to the collection (and there seem to be errors as well). For reference the version used is: MongoDB server version: 3.6.0.

> mongo
> use schema_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "string"
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  }
 })
{ "ok" : 1 }

A quick note: “bsonType” can be used in all levels in order to refer to BSON types, not just on the top level.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": 55,
   "numberOrdered": 30
  }, {
   "itemId": 56,
   "numberOrdered": 31
  }]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Along the way I ran into a validation issue as I constraint the array elements to strings, rather than objects, as used in the example of the first blog in this series. So I made a schema definition mistake.

To note is that the response on the shell does not indicate what the problem was making debugging hard, especially when large and complex schemas are to be debugged.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 1
})

Once I realized the mistake I made, I inserted a document complying to the schema in order to make sure I identified the issue correctly.

JSON Schema Validator Update

Obviously, after defining a wrong schema, the correct schema should be used as validator.

This is the correct schema:

{
 "bsonType": "object",
 "required": ["orderId", "orderDate", "orderLineItems"],
 "properties": {
  "orderId": {
   "bsonType": "int",
   "description": "Order Identifier: must be of 
                  type int and is required"
  },
  "orderDate": {
   "bsonType": "date",
   "description": "Order Date: must be of 
                  type date and is required"
  },
  "orderLineItems": {
   "bsonType": "array",
   "items": {
    "bsonType": "object",
    "properties": {
     "itemId": {
      "bsonType": "int"
     },
     "numberOrdered": {
      "bsonType": "int"
     }
    }
   },
   "description": "Order Line Items: must be of 
                  type array and is required"
   }
  }
 }

And this is the command to update the validator:

> db.runCommand({
  "collMod": "orders",
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict"
 })
{ "ok" : 1 }

Some background on the command used is here: https://docs.mongodb.com/master/reference/command/collMod/.

Following is an attempt to add one more of the (now mismatching) documents:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

As it should be, the insert fails.

And here the insert of a now correct document:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": NumberInt(55),
   "numberOrdered": NumberInt(20)
  }, {
   "itemId": NumberInt(56),
   "numberOrdered": NumberInt(21)
  }]
 });
WriteResult({
 "nInserted": 1
})

Collection Inconsistency: Mismatch of Schema and Documents

There is an interesting issue appearing at this point. The new schema does not match all existing documents in the collection. Or the other way around: the collection now contains documents that do not match that schema.

> db.orders.find()
{
 "_id": ObjectId("5a2022c3fb460d15db9ec73e"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": ["a", "b"]
} {
 "_id": ObjectId("5a202322fb460d15db9ec741"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": [{
  "itemId": 55,
  "numberOrdered": 20
 }, {
  "itemId": 56,
  "numberOrdered": 21
 }]
}

MongoDB did not flag that there are documents in the collection that will not match the new schema (even though the validation level strict was used).

Adding a validation action with value of “error” does not change the situation, either.

Implication to Semantics

Given that the schema of a collection can be changed at any time, and given that MongoDB does not fail the schema update based on mismatching documents already in the collection, examining the schema is insufficient to understand the structure of the documents in a collection.

So a collection with a schema does not ensure that all documents in that collection are schema compliant. It rather insures that from the point in time the schema was added or updated documents will have to comply. Previous documents in the collections are not affected.

Summary

The notion of “schema” in context of MongoDB is very different from the notion of “schema” in context of relational database management systems. In MongoDB the documents in the collection do not have to comply to the schema; they only do have to comply at time of insertion.

There will be more exploration coming up in the next blog on this topic in order to further understand the semantics of “schema” in context of MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 2): Where does the “Interesting” Code go?

The previous blog found that the “generic” indirect representation of JSON data is one way of supporting “schema-free” JSON objects or documents. Where does the “interesting” functional code live?

Indirect Representation

To recap, the indirect representation is a set of classes, functions, etc. (depending on programming language) that can manage JSON objects or JSON documents. All or most languages have libraries supporting JSON manipulation. For example, Jackson is such a library for Java.

These JSON libraries can manage any valid JSON structure, and they do not require a schema or the JSON objects being homogeneous. Two JSON objects representing the same concept like an order with different attributes (as shown in the previous blog) can be managed by such JSON libraries.

Structural Manipulation

Structural manipulation of JSON objects supports the addition, update or deletion of properties (members) as well as JSON array elements. Property values can be replaced, for example, a JSON string with a JSON object.

Through structural manipulation it is possible to change a JSON object as needed, when e.g. new details appear in form of additional properties.

Structural manipulation was demonstrated in a database context in the last blog: properties were added through the update statement. The same is possible in the indirect representation libraries in the various programming languages.

Computation

Structural manipulation is not the only code that is required as structural manipulation does not allow to compute any specific application semantics. For example, in context of orders, the total value of not yet shipped orders might be a value that needs to be computed.

In a database context this would be an aggregation query that sums up the amount of all orders that do not have the status of shipped.

In context of a programming language it would require a function that iterates through all orders and, like in the database aggregation approach, adds up the sum of those orders that have not shipped yet.

It probably would be implemented as a set of cooperating functions, like

DollarAmount getValueOfOrdersNotShipped(JSONArray orders)
boolean hasOrderShipped(JSONObject order)
DollarAmount getValueOfOrder(JSONObject order)

JSONArray as well as JSONObject are an example of an indirect representation holding order data as a JSON structure.

Note: of course, in the absence of a schema (which is assumed here), there is no assurance that the JSONArray or the JSONObject contain only orders or that the orders are homogeneous in structure. There has to be “trust” that this is indeed the case.

If validation is desired, and if no schema is available, then the only alternative is validating values in one or more JSON object properties. For example, order identifiers might be of a specific structure that uniquely identifies an identifier being an order identifier. This would require trust that the algorithms creating identifiers are correct.

Separation of Manipulation and Computation

The JSON libraries supporting the indirect representation are separate from the functional code (like the summing up of order values). The software architecture and design has to structure this separation and ideally ensures that all functions concerned with orders are “close” from a code structure or software architecture perspective.

There might be functions that can be reused across different concepts (like orders, returns, shipments, etc.), and they can be refactored out, of course, as in “normal” functional code.

Given the above rationalization, how does the absence of a schema come into the picture?

Implication of Schema Free JSON Objects

Since there is no schema, JSON objects can have a different structure even though they represent the same concepts. In context of orders,  let’s look at two use cases:

  • An order does not have a shipping status
  • An order does have a value but in a variety of data types

In a world without schema these are possible use cases and the functional code needs to check for those.

Addressing the first use case can be accomplished by checking for existence. Code can check if a property is present and react accordingly. In the above example, the code designer can choose to have hasOrderShipped() return false or throw an error in case there is no shipping status.

The second use case can be addressed by checking for the type of the value of the order. If possible, value transformations can be implemented in getValueOfOrder(), e.g., string to number; if it is not possible to transform, an error can be thrown.

Summary

In a schema free JSON context there are several aspects from a code perspective: functional code implementing application semantics is separate from the code that manages the structure of JSON objects. That separation must be carefully managed from an architectural perspective.

The functional code must anticipate non-homogeneous JSON objects and check for variation in order to be able to implement the functionality accurately.

But wait, there is more:-) The next blog will venture into more nuances.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 1): Preliminaries

Missing schema support and schema enforcement is touted as a good thing: is it really?

“Schema Free”, “Flexible Schema”, “Schema Per Document”

What is meant by a database supporting the JSON data structure without providing support and enforcement for schemas? Such a database is often characterized as “schema free”, or supporting a “flexible schema” or “schema per document”. What does it mean?

No matter how such a database supporting JSON it is labelled, it does not provide an interface to define, to manage or to enforce schema(s) for the data, aka JSON documents, it is managing (“enforcing” is used in the semantics a relational database enforces a schema). This means that a client (e.g., application code) can store JSON documents that have any form as long as those comply to the JSON (syntax) standard (and possibly proprietary extensions by the database system).

It furthermore means that JSON documents representing instances of the same concepts (like e.g. orders or games or employees) do not have to have the same structure. Those JSON documents can be different from each other, not only in values, but also in structure.

An example follows of a possible scenario (using MongoDB).

Example

The example stores initially two documents that have the same structure, and subsequently their structure diverges through updates. No schema enforcement prevents the changes.

use blog;
db.blogColl.insert({
  "orderId": 1,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 20
    },
    {
      "itemId": 56,
      "numberOrdered": 21
  }]
});
db.blogColl.insert({
  "orderId": 2,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 30
    }, 
    {
      "itemId": 56,
      "numberOrdered": 31
  }]
});
db.blogColl.update({
    "orderId": 1
  }, {
    "$set": {
    "specialInstructions": 
      "Drop of in front, not back of location"
  }
});
db.blogColl.update({
    "orderId": 2,
    "orderLineItems.itemId": 55
  }, {
    $set: {
      "orderLineItems.$.color": "transparent"
  }
});

The ability to store different JSON documents with different structures, even if they represent (instances of) the same concept, can be seen as a powerful feature. It allows modifying the data as needed to represent changing requirements or specific representation needs. Data migration is easier, too, as data can be changed in place.

This flexibility also has downsides and (engineering) cost that need to be considered and dealt with in a concrete implementation.

Application Implementation

Application code accessing a database has a full or partial representation of the data it queries (or in general manages) in the type system of the deployed programming language.

There are basically two choices an application (short for application code) has to represent data:

  • Direct representation
  • Indirect representation

In a direct representation the concept as stored in the database is defined as data structure in the programming language. For example, using Java as the programming language example, an order is represented as a Java class “Order”. This class has all the methods required to access the various elements of an order (that might be implemented as Java classes themselves). In this approach an order stored in the database, when queried, will be managed as an instance of the Java class Order in the application code. Methods support access or modification to the instance of order, and the methods are order semantics specific, like getOrderDate() or updateOrderLineItem() or totalNumberInidividualItems().

In the indirect representation, an order would be represented not as instance of a Java class that reflects the concept, but an instance of a “meta” Java class. This “meta” class is able to store all data from the database, not just orders. Such a class would have methods like createInstance(), setIdentifier(), setType(), addAttribute(), etc. A type would be “Order”, an attribute would be “lineItem”, etc.

Consequences Of Choice

Given the two representations discussed earlier (direct and indirect), the choice seems to be clear. While the direct representation can capture the semantics of a concept directly (aka, a Java class “Order” can implement order specific methods), this approach would not be able to easily (or at all) deal with changes in the database representation of orders. For example, if an additional attribute is added to the JSON document representing an order (as shown above), the Java class would not be able to change dynamically and capture it.

The indirect representation, however, would not have any problems representing order JSON objects with different structure as e.g. attributes can be dynamically added. This means that instances of the “meta” Java class can represent any JSON document as stored in the database.

These “meta” Java classes are actually already available in form of JSON processing libraries. Such libraries support the creation of JSON structure representations and they can represent any JSON object (or JSON array) that implements a correct JSON syntax.

Summary

At a first glance it looks like databases that support JSON without enforcing a schema at the same time are a good choice for ease of data management. And, using the indirect representation approach applications can deal with dynamically changing JSON objects or JSON objects of different structure representing the same concept.

However, as always, there are more details to discuss and additional aspects are going to be examined in the next blog.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 19: Implementation Notes

What was the decision process wrt. SQL syntax and execution semantics in context of JSON SQL?

Syntax

There are only two choices when adding JSON support to SQL: (a) design a new syntax (read: grammar) from scratch, or (b) extend an existing grammar.

Many approaches opted for the alternative (a). For JSON SQL that has been discussed in this blog series, I opted for the alternative (b).

The basic approach was to take an existing SQL grammar, in this case SQL 92 (https://ronsavage.github.io/SQL/sql-92.bnf.html) and implement it in ANTLR4. In the course of this implementation the JSON specific support required was added.

This comprised of only a few types of extensions or additions to the syntax. The main modifications are (all were discussed in individual blogs in the blog series):

  • Functions like exists_path() and is_of_type()
  • Constants like JSON true, JSON false and JSON null
  • Paths in order to refer to any property at any level into a JSON object
  • Constructors for JSON array and JSON object (like {} and []) asides from numbers and strings

In addition, the projection syntax was extended to support { and } in order to indicate that JSON objects are to be returned, and not tables.

Execution Semantics

Implementing the execution semantics for JSON SQL is based on the relational execution semantics. A direct mapping was implemented that maps SQL queries with JSON elements to SQL queries without JSON support.

This approach was chosen since re-implementing the relational semantics is not really necessary. It only needs to be extended for the JSON specific support and semantics. Existing relational databases put quite a bit of work in optimizing execution, and so a direct mapping allowed to take advantage of all the work that has been done.

Furthermore, mapping a SQL query to a SQL query allows easy semantic description and debugging as the semantics stays on a declarative level without going into specific algorithms.

In addition, this approach proves that JSON SQL does not require a different SQL execution semantics, only an extension.

Summary

In summary, JSON support in context of SQL can be a pure extension of the SQL syntax (language) as well as a pure extension of the relational semantics. It is unnecessary to define a special variant of SQL; extending a given (standard) syntax is possible and sufficient. The same applies to the JSON execution semantics – it can be a pure extension of the relational execution semantics.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.