SQL for JSON Rationalization Part 4: Projection – General Functionality

After the demo in the last blog (Part 3) it is time to discuss some of the assumptions and the projection functionality in more detail – here and in the next blog.

Assumptions: Array Start Index, JSON Literals and JSON Value Equality

The JSON standard does not define the starting index of the first array element. The assumption made here is that the first index is 0 (zero).

The JSON standard requires the literals “null”, “true” and “false” to be lower case. However, the assumption made here is that all lower as well as upper case combinations work, e.g., “True”, for convenience.

Another aspect the JSON standard does not define is equality on JSON values. There are many ways to define when two JSON values are equal. Here equality is defined on the string representation of the JSON values that contain no white space and where the property names in JSON objects are sorted alphabetically.

Definitions: Full and Partial Path

A full path is a sequence of property names as well as array indexes from the root of a JSON document all the way to one of its leaves. The elements of a path are separated by “.”. For example, “c.[3].e” is a full path from the root “c” to the leaf “e” in one of the demo documents of the previous blog. A path must start with a property name and cannot start with an array index. A path cannot be empty and the minimum path consists of a single property name.

Using “.” as separator is a random choice, but made by many systems. Having array indexes enclosed in “[” and “]” is customary also. Denoting an array index as separate path element (aka, enclosed in “.”) is also a convenient choice.

Given a JSON object, a full path might exist within it or not. Given a JSON object and a path, using the path as an access structure identifies a value only if the full path exists in the JSON document. If the path does not exist within the JSON document then no value is identified; especially not the JSON literal “null”.

A partial path is a sequence of property names and array indexes starting at the root, but not necessarily ending at a leaf, but at an intermediary property or array index. This supports “reaching into” a JSON document and identifying a JSON value that is between the root and a leaf.

Like in case of full paths, given a JSON object, a partial path might or might not exist within it. A partial path only identifies a JSON value if the partial path exists within a JSON object. In this case it identifies a composite JSON value.

If a JSON document has only scalar properties, then the root properties are the leaf properties at the same time. Paths in this context are full paths and partial paths cannot exist.

Projection

Unlike in the relational model, in context of the JSON model the result of a query can be returned as a relational table, or as a set of JSON documents. The choice is made by the query author.

The projection in a select statement contains one or more (full or partial) paths. If the paths are enclosed by a “{“ and “}” then JSON documents are returned, otherwise a table  (the asterisk projection is discussed below).

For example, the query from the previous blog

select a, b.c, d.[3].e from tinycoll

returns a table with three columns.

Semantically, each path in the projection will be a separate column. Each document from the collection “tinycoll” is taken and a corresponding row is added to the table. For each path of the projection that is found in the document the value is added to the row. If a path does not exist, no value is added in the column corresponding to the path. Therefore, a row can have values in every column, in some columns, or in no column, depending if the paths exist in the document.

As in relational SQL, the order of the paths matters as the corresponding columns will be created in that order.

The column names are created from the paths by replacing “.” in the path representation with “_” as many relational systems do not support “.” as column names.

The query

select {a, b.c, d.[3].e} from tinycoll

returns a set of JSON documents.

Semantically, each document from the collection “tinycoll” is taken and an empty result document is created for it. Each of the paths from the projection are followed within the document from the collection. If a value is found, the path with the corresponding value is added to the result document. It is possible that the document from the collection contains all, some, or none of the paths from the projection. Therefore, the result document might contain all, some, or none of the paths (empty document).

The order of the paths in the projection does not matter as JSON documents are created and order of properties / paths is not defined for JSON objects.

As a note, according to the construction principle of the result JSON documents, the paths in the projection of the select statement and the paths in the result JSON documents are exactly the same (if they exist). No translation is necessary from the viewpoint of the client between the paths in the query and the paths in the result documents.

Asterisk Projection

The asterisk projection is supported. The query

select {*} from tinycoll

returns all documents stored in the collection “tinycoll” as they are without any modification.

The query

select * from tinycoll

Returns a table that has any many columns as there are full and partial paths into all documents of the collection “tinycoll”.

Semantically, each document from the collection “tinycoll” is taken and a row is created for it. For each full as well as partial path in the document the value is retrieved and put into the corresponding column of the row. There is a column for each possible path and the set of columns can be predetermined or dynamically added to the result table as needed. As before, the column names are the path with the “.” replaced by “_”.

Summary

This was a first closer look into the details of projection in context of JSON SQL and the next blog will continue the project discussion. The key take away is that JSON SQL can return JSON documents as well as tables based on a well-defined execution semantics centered around JSON paths.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 3: Demo

The previous blog outlined an initial glimpse at a JSON SQL query language and how it works when applied to JSON documents. In the following, a demo shows a concrete implementation.

Command Line Interface

The command line interface provides a few basic commands as follows:

JQDR> help
JQDR - JSON Query Done Right
Commands:
    help
    exit
    executequery <JSON SQL query>
    load <table name> <file name>
    deleteload <table name> <file name>
    createtable <table name>
    droptable <table name>
    deletetable <table name>
    existstable <table name>
    listtables
    <JSON SQL query>
JQDR> 

Document Collection

The following very small collection “tinycoll” contains two documents:

{"a": 5,
 "b": {"c": 10, "d": 11},
 "c": [101, 102, {"d": 103}, {"e": 104}]
}
{"a": 5,
 "b2": [10, 11],
 "c": [101, 102, {"d": 103}, {"e": 104}]
}

Loading those into the database is accomplished by adding the documents into a file and then load the documents from the file into the database after the table “tinycoll” has been created:

JQDR> load tinycoll src/test/resources/blog/tinycoll.txt
JQDR> 

Currently the language does not support an insert statement, however, this is in the plans.

select {*}

A first JSON query is to select all the documents and output those as documents (requested by the { and } in the projection clause):

JQDR> select {*} from tinycoll
{"a":5,"b":{"c":10,"d":11},"c":[101,102,{"d":103},{"e":104}]}
{"a":5,"b2":[10,11],"c":[101,102,{"d":103},{"e":104}]}
JQDR>

select {a, b.c, c.[3].e}

A more interesting query is a projection that reaches into the documents:

JQDR> select {a, b.c, c.[3].e} from tinycoll
{"a":5,"b":{"c":10},"c":["<>","<>","<>",{"e":104}]}
{"a":5,"c":["<>","<>","<>",{"e":104}]}
JQDR> 

A few observations are:

  • Full paths are inserted into the result documents. This allows to access the result documents using the same paths that were used in the projection (aka, a, b.c and c.[3].e).
  • JSON does not have a “value does not exist” representation. Therefore, the JSON query processor inserts “<>” for array values that do not exist, but need to be present in order to provide correct array indexes.

For example, only the 4th array element was projected, so the first three must be part of the result, but represented as “values does not exist” as they were not requested in the projection. In a Relational SQL world SQL NULL would have been used in order to represent “values does not exist” (or is unknown).

select *

This query selects all documents, but the result is in relational table format, not JSON documents (as the { and } are omitted in the projection). Each path into any of the documents of the collection is represented as a separate column. The following shows the resulting 14 columns:

JQDR> select * from tinycoll
|a    |b_c  |b_d  |b               |b2_[0] |b2_[1] |b2      |c_[0] |c_[1] |c_[2]_d |c_[2]     |c_[3]_e |c_[3]     |c                             |
+-----+-----+-----+----------------+-------+-------+--------+------+------+--------+----------+--------+----------+------------------------------+
|5    |10   |11   |{"c":10,"d":11} |<>     |<>     |<>      |101   |102   |103     |{"d":103} |104     |{"e":104} |[101,102,{"d":103},{"e":104}] |
|5    |<>   |<>   |<>              |10     |11     |[10,11] |101   |102   |103     |{"d":103} |104     |{"e":104} |[101,102,{"d":103},{"e":104}] |
JQDR>

It is important to note that the column names are default names generated by the JSON SQL query processor and they actually represent the paths (however, instead of a “.”, a “_” is used for the representation).

select a, b.c, c.[3].e

A projection looks as follows

JQDR> select a, b.c, c.[3].e from tinycoll
|a    |b_c  |c_[3]_e |
+-----+-----+--------+
|5    |10   |104      |
|5    |<>   |104      |
JQDR> 

In this case only three columns are returned as the projection projects only three paths.

Summary

This demo has provided a first glimpse at a JSON SQL language that supports querying JSON documents and provides the option of returning results as JSON objects or in table form.

The next blog will focus more on projection and the various relevant details of the projection semantics.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 2: Is JSON SQL Superset of Relational SQL?

This is not really a precise question as it does not distinguish data model from query language – a distinction that it probably intended to make; so let’s rationalize both.

Table – Row

A table is defined in its basic form by columns and their types (domains), with columns having unique names within a table definition. The contents of a table are zero or more rows, whereby each row has to have for each column either a value (matching the column’s domain) or SQL NULL (indicating “unknown”); no deviation from this structure is possible, neither from the type system.

Collection – Document

In the JSON database world, JSON structures are called documents. Documents are in general grouped together in distinct groups and the grouping mechanism is a named collection. A collection can contain zero or more documents. Each document belongs to exactly one collection.

A collection in general does not impose a constraint on the document structure of its documents, so the whole range from all documents having the same structure to each document having a different structure is possible. That documents in a collection can have different structures is called “schema-free”, or more appropriately “implicit schema-per-document” instead of “schema-per-collection”.

Note: the structure of a document is the set of all paths from the root of the document to the leaves of the document. This definition allows a structural comparison: two documents having the same set of paths are structurally equivalent (or, are of the same schema if there was a schema definition).

Relational SQL

The term “Relational SQL” refers to SQL executing against tables. For example, “select * from abc” selects all rows from the table “abc”, which each row having values for each column of the table.

The query “select a, b from abc” would select all rows from the table “abc”, however, only the columns “a” and “b” within each row (projection).

JSON SQL

The term “JSON SQL” refers to SQL executing against collections. For example, in analogy to the Relational SQL query above, “select {*} from def” selects documents (indicated by the “{}”), and within each document all properties (indicated by “*”) from the collection “def”.

The query “select {a, b} from def” would select all objects from the collection “def”, however, only the properties “a” and “b”, in case these are present (more precisely, all paths starting with either “a” or “b” – this is one possible definition of projection in JSON SQL). If none of them is present in a document, then the result is the empty document. If only one property is present, the document only contains the one property.

For example, assume the collection “def” consisting of two documents:

{"a": 1, "b": 2, "c": 3}
{"d": 4, "e": 5, "f": 6}

The query “select {a, b} from def” would return

{"a": 1, "b": 2}
{}

JSON SQL with Table Output

What would the query “select a, b from def” return? This looks like a Relational SQL query, however, querying a collection of documents. The interpretation is to select the value of the property “a” and “b” from each document, however, not return a document, but a structure similar to a row with values only.

For example, “select a, b from def” applied to the above collection it would return

| a | b |
+---+---+
| 1 | 2 |
|   |   |

The first object would contribute the values “1” and “2”, the second object would not contribute values at all, so an empty row is returned (aka, a row where each column is empty and not SQL NULL as SQL NULL is not a valid value in the JSON structure definition).

Of course, the above example has scalar values for “a” and “b”; non-scalar values are supported as well, without restriction. It is therefore possible to have any JSON type in a column as value.

This small example shows that it is possible to not only return documents as a result of a JSON SQL query, but also tables (set of rows).

Collection with Schema Constraint

It is possible, at least conceptually, to enforce constraints on the documents of a collection with the interpretation that each document in the collection must satisfy the constraint. One possible constraint is that all document must have the exact same structure (schema) and only the values can be different.

For example, a constraint in context of collection “xyz” could be that all documents can only have top level properties with names “p” and “q” (paths of length 1), and the type of “p” must be a string, and the type of “q” must be an integer. Basically this constraint enforces a specific document schema on every document in the collection.

The collection “xyz” at a given point in time could have two documents:

{"p": "today", "q": 68}
{"p": "yesterday", "q": 71} 

Basically, this constraint enforces “flat” documents with scalar types, and this can be interpreted as the equivalent to rows of a table with two columns “p” and “q” of type string and integer.

Q.E.D.

With the appropriate constraint in place (aka, forcing “flat” documents of scalar property type), a collection can be interpreted as table and the documents as rows. Therefore it is possible to provide the equivalent of tables and the equivalent of Relational SQL in context of collections of documents supported by JSON SQL.

However, this is possible because of a constraint. Lifting that constraint allows documents to have any structure, and to have a representation beyond the table model. Therefore, JSON SQL on document collections is a superset of Relational SQL on tables.

Summary

In this blog it was shown that a constraint document environment can be interpreted as tables and JSON SQL can be used to accomplish the same functionality as Relational SQL. By lifting the constraint it follows that JSON SQL based on document collections is a superset of Relational SQL based on tables. The answer to the initial question is therefore “YES”.

The intriguing question now is, what will JSON SQL have to provide beyond Relational SQL in order to fully support the JSON document model? Future blogs will make an attempt to address this question in a structured and rationalized approach.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 1: Setting the Scene

Is it possible to technically rationalize SQL for JSON, aka, extend the SQL syntax and semantics to support JSON structures without flame wars during discussions? Let’s see.

Goal: Extend SQL to Support JSON Data Structures

The goal of the forthcoming blog series is to extend SQL with support for JSON data structures (aka, “JSON documents” or simply “documents”).

There is no consensus agreement in the community that this is the way to go at all; many dispute that extending SQL is the right direction and instead propose a different approach for querying JSON documents like specialized query APIs, Object-Relational-Mapper technology, representation of queries as JSON documents themselves, to name a few alternative proposals.

This is fine, as everybody is free to discuss their point of view in their favorite forum – no issues from my side:-)

However, this blog series is about extending SQL with support for JSON data structures and there is no justification necessary as this is a purely technical discussion. The attempt is made to impact SQL as little as possible (minimalist approach) while providing full JSON support.

Approaches to Extending SQL

There are different starting points of how to extend SQL to support JSON data structures:

  • start with a non-SQL approach and extend/adjust to SQL along the way
  • start with SQL in the first place

The latter approach is taken here. Given a SQL syntax (not necessarily a specific syntax from a specific standard or a specific system implementation) the discussion will revolve around SQL syntax and SQL semantics extension.

However, the discussion will not revolve around efficiency of SQL execution at this point and what the added cost of JSON data structure support is, if any. This blog series is laser-focused on syntax and semantics (only).

Direction From Here

The direction from here is (a) forward and (b) along the various SQL operations like for example

  • Projection
  • Selection
  • Cartesian Product
  • Join

Organizing the discussion along the SQL syntax and semantics is the framework for the discussion about the extension to support JSON data structures.

In addition, the JSON definition itself will be discussed and some missing pieces will be rationalized that are required in context of SQL. Operations or semantics that are not defined by the JSON standard are introduced as necessary, like for example

  • the starting index of an array (zero or one?)
  • the definition of normalization and equality
  • the difference between JSON null and that absence of a value (or absence of a pair)

Finally, the JSON terminology is incomplete and missing parts will be fit in for the purpose of this blog, for example

  • to name the string and value in a JSON pair
  • to name the elements of an array

This is not meant to be a proposal for refining the current JSON definition (in or outside of a standard). JSON is taken as is even though many wish to change it; however, in order to focus the discussion, no changes to JSON will be directly proposed.

Summary

Extending SQL “properly” to support JSON is going to be fun and I am looking forward to interesting discussions along the ride.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Flashback/Look-ahead – SQL: The Kingmaker of NoSQL Databases?

Over 1 ½ years ago I suggested that SQL will turn out to be an important factor for the success of NoSQL databases (https://realprogrammer.wordpress.com/2014/03/04/sql-the-kingmaker-of-nosql-databases/). Did that happen?

NoSQL / OldSQL / NewSQL: SQL Language Support on the Rise

When surveying a subset of the various database systems it becomes apparent that the support for relational SQL as well as JSON-supporting SQL is increasing. Right now this does not necessarily mean that there is an agreement on the particular language syntax, its semantic interpretation or execution style (declarative vs. functional vs. procedural). However, there is an (implicit) agreement that a (declarative) query language is necessary or advantageous, that SQL is a guiding principle and that JSON needs to be supported to various extent. A (not necessarily complete) list of different systems follows that have a (JSON-supporting) SQL language implemented in one form or another:

From the above selection it becomes clear that a lot of NoSQL, OldSQL, and NewSQL systems invest in SQL as well as JSON-supporting SQL and that the number of systems that support (JSON-supporting) SQL is definitely increasing.

Given that implementing the efficient execution of SQL is a significant engineering effort (and investment), JSON-supporting SQL is obviously seen as an important feature.

As an aside, an intriguing question is: why do database systems that fall into the category of “NoSQL” databases start providing SQL? It appears that the “No” in “NoSQL” is being re-interpreted from “no-SQL” to “no-Relational Model” with JSON-supporting SQL support. One possible answer to the question is the increasing complexity of queries that are being issued. While simple interfaces are good to get going initially, over time, as applications become more involved, complex queries reflect the increasing domain complexity and that is going to be reflected in the query language as well (for a taste, see here: https://realprogrammer.wordpress.com/2014/10/27/oracle-12c-sql-for-json-part-2-basic-queries/ – toward the end of the blog). Another (additional) possible answer is that existing tools (e.g. reporting tools) that are JDBC compliant and can process relational data should be able to connect also.

JSON-supporting SQL Language: Discussions are Ongoing

Discussions about the benefit of JSON-supporting SQL in context of NoSQL databases are coming up repeatedly (e.g., LinkedIn group discussions: https://www.linkedin.com/groups/2085042 (NoSQL Group)). Interesting enough, most of the time, the discussion is not separated or structured by syntax vs. semantics vs. functionality. SQL is developed in context of the relational model and that does not preclude complex data structures like JSON, the opposite is true. Object / relational functionality as well as NF2 work deals with complex structures, including arrays and maps. So it would be prudent, in my mind, to have a separation in the discussion of

  • Data structure support (aka, JSON)
  • Query functionality
  • Query language syntax

I expect the discussion of JSON-supporting SQL to lead towards a declarative language with SQL syntax, based on the relational model, paired with complex data types (e.g. JSON) and a schema-per-object or schema-per-row semantics.

The interesting aspect on the SQL syntax and interpretation is that systems that also support the relational model extend those to support JSON, and these systems will extend SQL, not replace it (as can be seen already).

SQL Support for NoSQL Databases: Best Approach Forward?

In my mind the best approach forward would be standing on the shoulder of giants: take SQL and extend it to support JSON in a very careful but very expressive way, learning from the object / relational work as well as the NF2 work that took place a long time back. Stay true to the learnings of the benefits of a declarative language and the underlying possibility of optimization, be it functional or non-functional. In short, add the domain type JSON and cater for a schema-per-document model.

If JSON support is one of the common functional denominators of most of the databases then the distinction is not really achieved by variations of (JSON-supporting) SQL, but by variations of the non-functional support like scale, throughput, low latency, safety, reliability, etc.

I believe that JSON-supporting SQL will be converging, while the non-functional features will set the various systems further and further apart in order to provide the designed-for sweet spot and establish their competitive advantage.

Summary

Clearly, the jury is still out, however, compared to 1 ½ years ago, whispers are emerging that JSON-supporting SQL is becoming an important feature of NoSQL, OldSQL and NewSQL databases.

Clearly, the pendulum swings toward the SQL language grammar and functionality that supports JSON. And that is a good thing in my opinion for the various obvious reasons.

Clearly, discussions keep coming up about JSON-supporting SQL, the various styles and language proposals. This is another reason to believe that SQL support in NoSQL databases and in context of JSON is important.

Let’s check in again into this topic next year some time. It is always interesting to follow such a hotly debated topic and establish checkpoints along the way..

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Software Design in Context of Database Failover (Part 5): Architecture Using Non-Transactional Persistent Resource Managers

Non-transactional persistent resource managers are tricky in context of database fail-over. A short discussion shines the light on some of the problems.

Non-Transactional Persistent Resource Managers

File systems and queuing systems are in general non-transactional persistent resource managers that are typically used outside transaction control, in many cases because the technology chosen does not support transactions in the first place (some implement transaction functionality, many do not; here I focus on those that do not). Files can be created, deleted or their contents manipulated; queue elements can be enqueued or dequeued from queues (in FIFO order, priority order, or any other supported queuing policy).

Because operations like inserting, updating or removing data items from non-transactional persistent resource managers are executed outside transaction control, they (a) cannot be coordinated in the same transactions of transactional resource managers, and (b) in case of a failure, the state of the data items is unclear since the non-transactional resource manager does not provide any transactional guarantees in failure situations (see the appendix below for the taxonomy on data consistency).

For example, when a data item is enqueued in a queue and in that moment the queuing system fails (by itself or because the underlying hardware failed) then there is no defined outcome for that enqueue operation. It could be that the data items is properly enqueued and all is fine. It is possible that the data item was not enqueued at all. Worse, it might be possible that the data item was enqueued, but not all supporting data structures were fully updated inside the queuing system, leaving it itself in an inconsistent state still requiring recovery actions.

Best Effort Consistency

From a software architecture viewpoint, non-transactional persistent resource managers are best-effort systems and this has to be kept in mind when they are used and deployed. With the rather reliable hardware servers nowadays this limiting property can easily be forgotten or pushed aside as the mean time between failures is rather long.

However, when data consistency is critical, and non-transactional resource managers have to be used, effort can be put in place to mitigate at least some of the potential data consistency problems.

Data Consistency through Data Duplication

A fairly straight-forward approach to integrate non-transactional resource managers is to store the data items twice: once in the non-transactional resource manager, and once in a transactional resource manager. Any update or delete must take place in both systems (however, as discussed, cannot be done inside a single transaction across both).

For example, every item enqueued in a non-transactional queue could also be inserted into a relational table. If an item is dequeued, it is also removed from the transactional table. The logic could first enqueue the data item into the queue, and when successful, insert it into the table. Upon dequeue, the same applies. The items is first dequeued from the queue, and if successful, removed from the table. The invariant is that a queued data item is only considered consistent if it is enqueued in the queue and also present in the table at the same time. 

Studying only the enqueue case, there are several failure cases:

  • Enqueue fails.
  • Enqueue succeeds, but system fails before the insert into the table took place.
  • Enqueue succeeds, but insert in the table fails.

In the first case, the enqueue fails and the insert into the table will never take place. So the invariant is fulfilled. In the second case the item is in the queue, but not in the table. The invariant is violated. The system can now decide how to deal with the situation after the failure: it can remove the queue item, or it can insert it into the table. The same situation occurs in the third case.

Underlying this approach is the requirement that it is possible to check for the presence of the data item in the non-transactional as well as transactional resource manager in order to determine if the data item is present after a failure. For example, in the second case above, after the systems comes back up, the following check has to occur: (a) for every item in the queue an equivalent items must be in the table and (b) for each item in the table there must be an item in the queue. If there is a data item in one, but not the other resource manager, the invariant is violated and the inconsistency has to be removed by either removing or adding the data item so that none or both of the resource managers have the data item (and therefore the invariant holds).

Database Fail-over Data Loss Recovery

When a database fails over, as discussed in earlier blogs, it is possible that no data loss occurs. In this case the transactional resource manager remains consistent with the non-transactional resource manager.

However, a data loss is possible. This means that one or more data items in the transactional resource manager that were present before the fail-over are not present anymore after the fail-over. In this case either the items need to be added to the transactional resource manager again (based on the content in the non-transactional resource manager), or they have to be removed from the non-transactional resource manager also.

The particular strategy is to be determined in a specific software architecture, however, with data duplication this choice can be made as needed and required by the application system at hand.

Effort vs. Consistency

The outlined approach above (which can be applied to any non-transactional resource manager that can check for the existence of data items) fundamentally makes the non-transactional resource manager pseudo transactional by pairing it up with a transactional resource manager. This is implementation effort made in order to provide data consistency.

Less effort might mean less consistency. For example, it would be sufficient to store the data item in a table only for the duration until the data item is safely enqueued in the queue. Once that is the case the data item could be removed from the table. While this would ensure the consistency during the enqueue operation, it does not ensure consistency during the dequeue operation since not every type of failure during a dequeue operation would be recoverable.

In a data loss situation because of a database fail-over there would be no way to reconcile the two resource managers if not all data is available in both at the same time. In a particular situation that might be acceptable, but in general it would not.

Effort vs. Performance/Throughput

Additional effort takes additional resource, both in storage space as well as processing time. The price for making the system more consistent is a possibly a slower system, with less throughput, and higher storage space requirements. Again, this is trading off non-functional with functional properties.

Summary

The outlined issues are present in all systems that deploy non-transactional persistent resource managers. The outlined solution is a generic one and in specific situations alternative, more efficient approaches might be possible in order to guarantee data consistency.

As discussed, the choice of non-transactional persistent resource managers paired with the need for data consistency can be expensive from an engineering and system resource perspective. While the choice is not always a real choice, it is definitely worth evaluating alternatives that provide the required functionality in a transactional resource manager for comparison.

This blog is the last in the series around transactional and non-transactional resource manager use in context of database fail-over. Clearly, the software architecture has to be aware of the fact that a database fail-over is possible and that a data loss might occur along the way due to network errors, increased network latency and system failures. Some of the architectural problems have been discussed, and some solutions presented as starting points for your specific application system development efforts.

Go SQL!

Appendix: Taxonomy

The software architecture taxonomy relevant for database fail-over can be built based on the combinations of resource manager types used. In the following the various combinations are discussion on a high level (“x” means that the software architecture uses one or more of the indicated resource manager types).

Software Architecture Transactional Persistent Non-transactional Persistent Non-transactional and Non-persistent and rebuildable Non-transactional and Non-persistent and Non-rebuildable
Consistent x
Consistent x x
Possibly consistent x x
Possibly consistent x x x
Possibly consistent x x x x

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Software Design in Context of Database Failover (Part 4): Architecture on Transactional Persistent Resource Managers in Conjunction with Caches

If transactional resource managers are available, then all is good (see last blog). But what if this is not the case? This blogs starts discussing non-transactional resource managers.

Non-Transactional Resource Managers

There are different types of non-transactional resource managers (see the taxonomy in the appendix of this blog):

  • Non-transactional and persistent. Data is stored on persistent storage, but outside transaction protection.
  • Non-transactional, non-persistent and non-rebuildable. Data is managed in volatile memory, outside transaction protection, and if lost it cannot be rebuilt (it will be truly lost).
  • Non-transactional, non-persistent and rebuildable. Data is managed in volatile memory, outside transaction protection, but can be rebuilt if lost or intentionally deleted.

A non-persistent resource manager is usually called a cache. An example of a cache is a list of values in a Java singleton caching a subset of the data of a database table column. Another example is the user interface that caches a list of entries like sales opportunities and displays them to a user.

Proper implemented caches can be invalidated (aka, emptied out) and rebuilt based on data stored in one or several database tables as they are non-functional components put in place for performance reasons. Caches that can be invalidated are called “rebuildable caches”.

Caches that cannot be invalidated without loosing data are called “non-rebuildable caches”. Those should not be part of the software architecture and have to be avoided as lost data cannot be recovered. The impact of this is clear: you don’t want your checking account information be managed in such a cache.

In the following caches are discussed; the third type of resource manager, non-transactional and persistent, is going to be discussed in the next blog.

Rebuildable Cache and Database Fail-over

A rebuildable cache contains copies of data from the database. If a cache is accessed and it contains the requested data (“cache hit”), then the data is returned to the caller. If the cache does not contain the requested data (“cache miss”) the cache fetches the data first from the database and stores it before passing it on to the caller. In summary, the cache either has the data or knows how to get it from the database.

In context of a database fail-over it is possible that the primary and secondary database have the exact data set at the time of the fail-over [see blog https://realprogrammer.wordpress.com/2015/08/31/software-design-in-context-of-database-failover-part-2-software-architecture-taxonomy/]. In this case the cache contents remains consistent with the database state after a fail-over as all data in the cache has a corresponding data item in the database.

However, it is possible that a data loss occurs if the secondary database is lagging the primary database at the time of the fail-over. In this case some data might not be available anymore after the fail-over as this data was not replicated before the fail-over occurred.

A cache that has data that was available before the fail-over, but lost during the fail-over is therefore inconsistent with the database after the fail-over. A cache hit might occur, however, the cached data is not in the database. This would be incorrect behavior since the premise of a cache is that it caches data consistent with the database state.

The solution for the fail-over case is that all cache entries are invalidated (from all caches) right after the database fail-over took place before clients continue processing. Since each cache starts empty, many accesses will be a cache miss initially until the caches have built up the working set again. However, the caches will be consistent with the database state and the application system accesses consistent data.

From a software architecture approach is it very important to be able to know all implemented caches and ensure that each is a rebuildable cache.

Non-rebuildable Cache and Database Failover

Even if not desirable, non-rebuildable caches might exist and the cache invalidation functionality discussed above is not available. If the cache cannot be invalidated it is impossible to remove its contents after a database fail-over. Therefore it might be that the cache returns data to the client that is not present in the database anymore or not consistent with the data in the database. The client would have to be aware of this possibility and able to deal with such an inconsistency.

A special case needs to be discussed. It is possible that a non-rebuildable cache stores data from the database, but it is only missing the externally available functionality of cache invalidation. So in principle it could be rebuilt, but the rebuilt cannot be externally triggered due to a missing programming interface.

A possible approach in this case is to try to implement a workaround that is based on the assumption that a cache has limited capacity and that there is a lot more data in the database than the cache can hold. A brute force attempt to make the cache consistent again would be to implement a client after fail-over that requests every data item that is in the database from the cache. At some point the cache will be full without any capacity left and as a consequence the cache has to evict entries. As this happens, inconsistent entries will be removed to make room for those the client requested and caused a cache miss. Once all data was requested, the cache is consistent again.

Despite workarounds in special cases, there is no general approach that can address non-rebuildable caches. They have to be avoided if transactional consistency is important as consistency with these caches cannot be provided.

Summary

If caches are used in the software architecture, the best form is rebuildable caches that can recreate their state from persistent storage (aka, database) so that a consistent data state can be recreated after a loss-ful database fail-over. Those must have external interfaces so that an invalidation can be triggered right after the database fail-over completed.

Non-rebuildable caches might be managed in special cases with a workaround after a fail-over, however, as no general mechanism exists, they should be avoided altogether. Avoiding non-rebuildable caches is a safe architectural approach.

The next blog will discuss the third category of resource managers: non-transactional and persistent resource managers.

Go SQL!

Appendix: Taxonomy

The software architecture taxonomy relevant for database fail-over can be built based on the combinations of resource manager types used. In the following the various combinations are discussion on a high level (“x” means that the software architecture uses one or more of the indicated resource manager types).

Software Architecture Transactional Persistent Non-transactional Persistent Non-transactional and Non-persistent and rebuildable Non-transactional and Non-persistent and Non-rebuildable
Consistent x
Consistent x x
Possibly consistent x x
Possibly consistent x x x
Possibly consistent x x x x

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.