SQL for JSON and Schema Support (Part 6): Dynamic Schema Change and Symbolic Computation

How does software engineering look like in context of the absence of schema support? What now?

Summary – Where Are We?

The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.

Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).

Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).

A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).

Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.

So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.

Engineering Approaches

If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.

The following sections discuss one approach of run-time assertions and the phases of processing:

  • Run-time assertions
  • Assertion execution frequency
  • JSON document non-compliance
  • Non-compliance analysis and mitigation

In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.

As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.

Run-Time Assertions

Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.

For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):

assert(exists_path(order.value))
assert(is_of_type(order.value, number) 
       || is_of_type(order.value, string))

Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.

The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.

Assertion Execution Frequency

Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.

There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.

some_function() {
// retrieve document
// assert(...)
// execute logic
}

some_function_2() {
// retrieve document
// assert(...)
// execute logic
}

In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.

// start transaction

// assert()

some_function() {
// retrieve document
// execute logic
}

some_function_2() {
// retrieve document
// execute logic
}

// end transaction

In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.

Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.

Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.

JSON Document Non-Compliance (Violated Assertions)

If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:

  • The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
  • An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
  • Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).

In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.

If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.

Non-Compliance Analysis and Mitigation

When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):

  • Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
  • The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
  • It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
  • It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.

Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.

Thought Experiments on Alternative Assertion Implementation

Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.

The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.

Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.

This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).

A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.

There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.

Summary

When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.

In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisement

SQL for JSON and Schema Support (Part 5): Intermezzo 3 – MongoDB’s $jsonschema

The previous blog discussed MongoDB’s $jsonschema behavior with a strict validation level. Let’s look at the moderate validation level in this blog.

Example

As usual, first, let’s create a collection and add a few JSON documents to it. Afterwards a schema validation is added with the moderate setting (the following is based on MongoDB version 3.6.1).

> mongo
> use moderate_exploration

Initially, before adding a schema, two JSON objects are inserted that are not compliant with the schema that is going to be added afterwards. The reason is that we need non-compliant JSON objects to discuss the moderate level later.

> db.orders.insert({
   "orderId": 1,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 20
    }, {
    "itemId": 56,
    "numberOrdered": 21
   }],
   "specialInstructions": "Drop of in front, 
                           not back of location"
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": 2,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

Now the schema is added:

> db.runCommand({ 
   "collMod": "orders",
   "validator": {  
    "$jsonSchema": {   
      "bsonType": "object",
       "required": ["orderId", "orderDate", "orderLineItems"],
       "properties": {
        "orderId": { 
         "bsonType": "int",
         "description": "Order Identifier: must be of 
                         type int and is required"
        },
        "orderDate": { 
         "bsonType": "date",
         "description": "Order Date: must be of 
                         type date and is required"
        },
        "orderLineItems": { 
         "bsonType": "array",
         "items": {  
          "bsonType": "object",
          "properties": {   
           "itemId": {    
           "bsonType": "int"   
           },
           "numberOrdered": {    
           "bsonType": "int"   
           }  
          } 
         },
         "description": "Order Line Items: must be of 
                         type array and is required"
      }   
     }  
    } 
   },
   "validationLevel": "moderate",
   "validationAction": "error"
  })
{ "ok" : 1 }

After the schema is added, two more JSON objects are inserted, this time being schema compliant.

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(60)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(61)
   }]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": NumberInt(55),
    "numberOrdered": NumberInt(80)
    }, {
    "itemId": NumberInt(56),
    "numberOrdered": NumberInt(81)
   }]
  })
WriteResult({ "nInserted" : 1 })

At this point the created collection is governed by a schema, and contains four JSON documents, two are compliant with the schema (orderId 3 and 4), and two are not compliant (orderId 1 and 2).

Analysis

The MongoDB documentation states for “moderate”: “Apply validation rules to inserts and to updates on existing valid documents. Do not apply rules to updates on existing invalid documents.” (https://docs.mongodb.com/manual/reference/command/collMod/#validationLevel).

Let’s explore now the behavior of the moderate validation level.

First, let’s try to insert a non-compliant JSON document. The insert will fail as expected:

> db.orders.insert({
   "orderId": 5,
   "orderDate": ISODate("2017-09-30T00:00:00Z"),
   "orderLineItems": [{
    "itemId": 55,
    "numberOrdered": 40
    }, {
    "itemId": 56,
    "numberOrdered": 41
   }],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Second, let’s try to update a compliant JSON document that already exists in the collection in a non-compliant way:

> db.orders.update({  
   "orderId": NumberInt(3) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-09"  
   } 
  })

As expected the update fails:

WriteResult({
 "nMatched" : 0,
 "nUpserted" : 0,
 "nModified" : 0,
 "writeError" : {
  "code" : 121,
  "errmsg" : "Document failed validation"
 }
})

Third, let’s try to update a non-compliant JSON document

> db.orders.update({  
   "orderId": NumberInt(1) 
   }, {  
   "$set": {   
    "orderDate": "2018-01-10"  
   } 
  })

As per the above explanation of moderate this should work and indeed it does:

WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Bypassing Validation

With the correct permission (https://docs.mongodb.com/manual/reference/privilege-actions/#bypassDocumentValidation) it is possible to bypass document validation.

This allows for the situation that e.g. a collection is governed by a new schema, however, existing application code might have to continue to insert or to update documents with a structure that violates the new schema as the logic cannot be adjusted to the new schema quickly enough (including transforming the non-compliant to compliant JSON documents).

Summary

The brief analysis of MongoDB wrt. document validation in context of JSON schemas added to collections in the last three blogs showed that while schema supervision is possible, it is not as strict as in relational database management systems.

Basically, if a schema is present, a user cannot infer that all documents in that collection comply to that schema. A schema related to a collection can be changed, and existing documents that would violate the new schema on insert will not be discarded from the collection. Furthermore, properties that are not covered by the schema can be added and changed freely.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 4): Intermezzo 2 – MongoDB’s $jsonschema

After some initial exploration in the previous blog, more aspects on MongoDB’s $jsonschema are looked at in the following.

Example

First, let’s create a collection as follows. It is governed by a schema, and validation is in the strictest setting (the following is based on MongoDB version 3.6.0).

> mongo
> use more_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict",
  "validationAction": "error"
 })
{ "ok" : 1 }

The two documents from the example outlined in the initial blog of series are added next.

> db.orders.insert({
   "orderId": NumberInt(1),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(20)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(21)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })
> db.orders.insert({
   "orderId": NumberInt(2),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(30)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(31)
    }
   ]
  })
WriteResult({ "nInserted" : 1 })

Insert Strictness and Partial Schema Coverage

The validator is in place on the collection “orders”. This can be verified with the command

> db.getCollectionInfos({name: "orders"})

Now let’s try and add a document that has additional properties in addition to those that comply with the schema as follows:

> db.orders.insert({
   "orderId": NumberInt(3),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": [{
     "itemId": NumberInt(55),
     "numberOrdered": NumberInt(40)
    },
    {
     "itemId": NumberInt(56),
     "numberOrdered": NumberInt(41)
    }
   ],
   "preferredColor": "red"
  })
WriteResult({ "nInserted" : 1 })

It appears that as long as the schema is satisfied, additional properties can be inserted. So the schema is not completely covering the object to be inserted, but only those properties that are defined in the schema (validator). It is a partial schema coverage.

Here is the counter example: the value of the property “orderLineItems” is not in compliance, and so the insertion fails:

> db.orders.insert({
   "orderId": NumberInt(4),
   "orderDate": new Date("2017-09-30"),
   "orderLineItems": ["b", "g"],
   "preferredColor": "red"
  })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Update Strictness and Partial Schema Coverage

The following updates an existing document:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": new Date("2017-10-01")
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

In part 1 of this blog series the order with identifier 1 was updated to add a property “specialInstructions”. This is not schema compliant, however, the update is possible as it does not violate that part of the document that is covered by the schema.

> db.orders.update({
   "orderId": NumberInt(1)
   }, {
   "$set": {
    "specialInstructions": "Drop of in front, 
                           not back of location"
   }
  })
WriteResult({
 "nMatched": 1,
 "nUpserted": 0,
 "nModified": 1
})

Partial schema coverage applies to update as well, not just to inserts.

An example of a non-compliant update is the following:

> db.orders.update({
   "orderId": NumberInt(2)
  }, {
   "$set": {
    "orderDate": "2017-09-30"
   }
  })
WriteResult({
 "nMatched": 0,
 "nUpserted": 0,
 "nModified": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Summary

MongoDB supports partial schema coverage in strict mode, meaning, properties defined in the schema must match the schema, however, properties not specified in the schema can be added or modified without rejection.

This means (again) that examining the JSON schema validator of a MongoDB collection only indicates properties common to all documents, but not the complete set of properties of all documents.

The next blog examines the non-strict validation setting of a JSON schema validator in MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 3): Intermezzo 1 – MongoDB’s $jsonschema

MongoDB introduced support for JSON Schema through $jsonschema. Let’s explore this new functionality a bit in this blog.

$jsonschema

The functionality is introduced here: https://docs.mongodb.com/master/reference/operator/query/jsonSchema/#op._S_jsonSchema It states “$jsonSchema can be used in a document validator, which enforces that inserted or updated documents are valid against the schema.”

A first item to note is that this approach is supporting BSON types (http://bsonspec.org/, https://docs.mongodb.com/master/reference/operator/query/type/), not just JSON structures (https://www.json.org/), using a specific property “bsonType” that is not part of the JSON Schema standard (http://json-schema.org/).

A second observation is that the schema specification is inline with the collection creation and cannot refer to a separate JSON schema file or JSON object representing a JSON schema.

JSON Schema Validator Example

Let’s use the example of the first blog in this series, create a schema for it and use that as a constraint for the “orders” collection. Then documents are added to the collection (and there seem to be errors as well). For reference the version used is: MongoDB server version: 3.6.0.

> mongo
> use schema_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "string"
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  }
 })
{ "ok" : 1 }

A quick note: “bsonType” can be used in all levels in order to refer to BSON types, not just on the top level.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": 55,
   "numberOrdered": 30
  }, {
   "itemId": 56,
   "numberOrdered": 31
  }]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Along the way I ran into a validation issue as I constraint the array elements to strings, rather than objects, as used in the example of the first blog in this series. So I made a schema definition mistake.

To note is that the response on the shell does not indicate what the problem was making debugging hard, especially when large and complex schemas are to be debugged.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 1
})

Once I realized the mistake I made, I inserted a document complying to the schema in order to make sure I identified the issue correctly.

JSON Schema Validator Update

Obviously, after defining a wrong schema, the correct schema should be used as validator.

This is the correct schema:

{
 "bsonType": "object",
 "required": ["orderId", "orderDate", "orderLineItems"],
 "properties": {
  "orderId": {
   "bsonType": "int",
   "description": "Order Identifier: must be of 
                  type int and is required"
  },
  "orderDate": {
   "bsonType": "date",
   "description": "Order Date: must be of 
                  type date and is required"
  },
  "orderLineItems": {
   "bsonType": "array",
   "items": {
    "bsonType": "object",
    "properties": {
     "itemId": {
      "bsonType": "int"
     },
     "numberOrdered": {
      "bsonType": "int"
     }
    }
   },
   "description": "Order Line Items: must be of 
                  type array and is required"
   }
  }
 }

And this is the command to update the validator:

> db.runCommand({
  "collMod": "orders",
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict"
 })
{ "ok" : 1 }

Some background on the command used is here: https://docs.mongodb.com/master/reference/command/collMod/.

Following is an attempt to add one more of the (now mismatching) documents:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

As it should be, the insert fails.

And here the insert of a now correct document:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": NumberInt(55),
   "numberOrdered": NumberInt(20)
  }, {
   "itemId": NumberInt(56),
   "numberOrdered": NumberInt(21)
  }]
 });
WriteResult({
 "nInserted": 1
})

Collection Inconsistency: Mismatch of Schema and Documents

There is an interesting issue appearing at this point. The new schema does not match all existing documents in the collection. Or the other way around: the collection now contains documents that do not match that schema.

> db.orders.find()
{
 "_id": ObjectId("5a2022c3fb460d15db9ec73e"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": ["a", "b"]
} {
 "_id": ObjectId("5a202322fb460d15db9ec741"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": [{
  "itemId": 55,
  "numberOrdered": 20
 }, {
  "itemId": 56,
  "numberOrdered": 21
 }]
}

MongoDB did not flag that there are documents in the collection that will not match the new schema (even though the validation level strict was used).

Adding a validation action with value of “error” does not change the situation, either.

Implication to Semantics

Given that the schema of a collection can be changed at any time, and given that MongoDB does not fail the schema update based on mismatching documents already in the collection, examining the schema is insufficient to understand the structure of the documents in a collection.

So a collection with a schema does not ensure that all documents in that collection are schema compliant. It rather insures that from the point in time the schema was added or updated documents will have to comply. Previous documents in the collections are not affected.

Summary

The notion of “schema” in context of MongoDB is very different from the notion of “schema” in context of relational database management systems. In MongoDB the documents in the collection do not have to comply to the schema; they only do have to comply at time of insertion.

There will be more exploration coming up in the next blog on this topic in order to further understand the semantics of “schema” in context of MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.