How does software engineering look like in context of the absence of schema support? What now?
Summary – Where Are We?
The structure of existing JSON objects can change at any time in a NoSQL database (in general). New, possibly by the code so far unseen structures of JSON objects can appear at any time in existing as well as in new documents.
Schema support does not provide any guarantees (see the MongoDB discussion in previous blogs) – neither for current objects’ structure nor for future objects’ structure, unless a NoSQL database enforces the schema as rigorously as a relational database management system (RDBMS).
Most NoSQL database management systems provide zero schema supervision in the strict sense (no guarantee that stored JSON documents are guaranteed to be schema compliant).
A developer cannot make any assumptions whatsoever about the schema of the stored JSON documents (maybe with the exception of the mandatory existence of an identifier property) unless the structure is rigorously enforced by every piece of code inserting or updating documents (in general a hard problem to solve).
Code makes assumptions when accessing JSON documents as it accesses properties at certain locations inside the JSON document and based on a specific data type. Those assumptions about location and type might or might not be true for every single JSON document that the code accesses.
So where are we? Basically, the code is on its own to deal with the structure and data type uncertainty. The following discussion is based on the situation that the underlying NoSQL database does not provide any strict schema guarantees.
Engineering Approaches
If there is no guarantee provided by a database system, the compliance check for each document has to be done by the code via assertions on the required structure and data types of values. Before a document is processed, it has to be asserted that the expected structure and data types exist as otherwise the processing would fail.
The following sections discuss one approach of run-time assertions and the phases of processing:
- Run-time assertions
- Assertion execution frequency
- JSON document non-compliance
- Non-compliance analysis and mitigation
In addition, alternative approaches to run-time assertions are discussed in order to improve efficiency.
As a side note, if code inserting or updating documents would ensure a specific schema rigorously, then code accessing documents will not have to implement assertions. However, this requires a very strict engineering discipline around current and future code that inserts or updates (including 3rd party tools that support direct database access). In many cases this is impossible, hence the emphasis here on what the accessing code can or has to do. This approach of guarantee by inserting or updating code will not be discussed further here.
Run-Time Assertions
Code accessing a document requires specific properties to be present and to be of a specific data type in order for it to execute its logic.
For example, if code needs to access a property like the value of an order, then the code needs to assert the existence of that property and the acceptable data types. Those might be number or string (as a string value could be converted to a number in the code):
assert(exists_path(order.value)) assert(is_of_type(order.value, number) || is_of_type(order.value, string))
Code must assert expectations before access in order to ensure that the document matches those expectations and does not fail because of document non-conformance. If there is no “value” property in an order, the code would not be able to access is.
The blog https://realprogrammer.wordpress.com/2016/11/29/sql-for-json-rationalization-part-11-json-specific-predicates/ introduced the two essential predicates used above (exists_path() and is_of_type()) in context of NoSQL Query.
Assertion Execution Frequency
Checking assertions affects code efficiency due to additional execution, so how often do assertions have to be checked? In principle, before every document access.
There is a variation between NoSQL databases that provide single document transactions and those providing multi document transactions. In the single document transaction case, documents can change at any time in a NoSQL database and so between two separate accesses, the document could have been changed by other code. Assertions have to be checked every time before accessing a document.
some_function() { // retrieve document // assert(...) // execute logic } some_function_2() { // retrieve document // assert(...) // execute logic }
In case multi document transactions are supported, all assertions can be checked at once when the transaction is started, and does not have to be checked again within the same transaction.
// start transaction // assert() some_function() { // retrieve document // execute logic } some_function_2() { // retrieve document // execute logic } // end transaction
In general, code can operate on a single document or on a set of documents. A single document scope means that the code reads and possibly changes a single document only. A multi document scope means that the code reads one or more documents and possibly modifies one or more documents. For example, summing up the value of all not yet shipped orders and storing the sum in an analysis document. In either case, assertions have to be checked before every document access.
Executing assertions before every access is quite resource intensive at high processing rates. Later on variations of assertions are discussed that limit the effort and the resource consumption in order to make the use of assertions more practical or more efficient.
Note: MongoDB realized the enormous need for multi document transactions and is planning to implement those: https://www.mongodb.com/transactions.
JSON Document Non-Compliance (Violated Assertions)
If one or more assertions are violated, then the document is non-compliant from the viewpoint of the code accessing it. Assertion violation handling and analysis has to take place and there are different options:
- The document could be ignored (which might or might not be a good option in context of the code’s functionality) if it is irrelevant or negligible for the code’s computation. The code could record the ignored documents for future off-line analysis.
- An exception can be thrown and the code aborts its execution. This is a hard failure and ensures that no incorrect result is produced by the code.
- Error handling can take place like for example the code “parks” the document for future manual analysis and processing. In this case the code produces a result with the disclaimer that there are unprocessed documents that since not processed might affect the code’s accuracy of the result. Compared to the first case the code returns an error or warning highlighting the potential inaccuracy (and in the best case the degree of inaccuracy).
In case of single document processing the handling is easier compared to multi document processing. In the latter case, the code might have updated several documents before reaching a non-compliant one. If the handling is to abort execution, then the already updated documents might have to be reverted back (compensation logic). If there are multi document transactions available the database takes care of reverting the documents as part of rolling back the transaction.
If multi document transactions are not supported alternative approaches have to be found. For example, the code could remember the previous document state itself for each document updated and therefore would be able to revert the state (code based compensation). However, those documents could have been changed in the meanwhile, adding further complexity to the compensation logic.
Non-Compliance Analysis and Mitigation
When code encounters non-compliant documents then mitigation has to be determined so that this case will not happen in the future anymore. There are several possible approaches (not necessarily complete):
- Based on the non-compliant document analysis, the code is changed or extended to be able to process those cases as well. This would change the assertions so that in the future the documents would be considered compliant.
- The code that inserts or updates the documents in the first place has to change in order to ensure that the assertions of subsequent accesses will be met. Code accessing the documents will have assurance that this non-compliance will not happen anymore.
- It is acceptable to have manual post-processing. Instead of changing the code and its assertions, it is accepted that manual work is performed in order to produce an accurate result. This might require updating the result manually (like the total amount of all open orders).
- It is acceptable to ignore non-compliant documents. This is a valid analysis result as well if the logic of the code allows the result to be based only on a subset of the documents.
Different approaches to mitigation have a different impact on the processing time as well as the availability of the improved code base. Of course, avoiding non-compliance in the first place would be best.
Thought Experiments on Alternative Assertion Implementation
Are there possible ways to reduce the amount of assertion processing? In the following approaches are discussed as thought experiment.
The first approach is shadow schema-per-document management for documents. A schema is not declared upfront and enforced by the database, but derived from a document right after its creation. Each document has its own schema and every time the document is updated, its schema is updated as required in order to remain in sync with the document itself.
Code accessing a document can check the document’s schema first before retrieving the document in order to determine if the assertions are met. The assertions are in this case expressed as schema interrogation instead of document structure access. Every time the schema is adjusted to match a document update, its version is increased.
This by itself does not have a huge saving in terms of access, possibly worse, as the schema has to be retrieved in addition to the document itself. However, in the case where many documents have the same schema, the code accessing documents could know which schemas are satisfying the assertions. So just knowing the schema identifier and its version might be sufficient to establish the compliance check. In addition, if a document contains a reference to its schema and version, then verifying assertions is simply comparing a schema name and version. This is highly efficient as the code can keep a list of compliant schemas and their versions (assuming many documents comply to the same schema).
A second and very different approach is to add assertions to the document itself and an indicator for each assertion if it is satisfied or not within the given document. Code accessing a document the first time checks its assertions, and also adds the assertions to the document. Code updating a document re-evaluates the assertions after the update and sets the indicator. Subsequent accesses do not have to evaluate the assertions anymore every time, but lookup the indicator only. This reduces processing at the time of access. A possible optimization is that code updating the document only has to re-evaluate the assertions if the update changes the structure and/or the data types.
There might be further approaches to reduce and to optimize assertion execution. However, nothing beats a supervised and enforced schema when structure and/or data type compliance are paramount.
Summary
When a NoSQL database does not provide rigorous schema enforcement, the code is on its own to establish document structure and data type compliance when accessing documents. The code has to be very defensive in order to avoid execution failure. Due to the execution effort involved, schema support and strict supervision might be really important for the code accessing documents.
In addition, especially when having to abort multi document processing, multi document transactions are essential when several documents are updated during the processing in order to avoid complex compensation logic. Software engineers have to be super careful with compensation logic that cannot rely on multi document transactions provided by the database.
Go [ JSON | Relational ] SQL!
Disclaimer
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.