In order to discuss software design in context of database fail-over I will present a taxonomy of software architectures that manage data. This is the basis for discussing all possible database fail-over cases and their specific issues. Let’s go.
In the previous blog the ideal database software architecture (“Happy Path”; [https://realprogrammer.wordpress.com/2015/07/30/software-design-in-context-of-database-failover-part-1-the-happy-path/]) was discussed; why is this not sufficient to cover all types of software architectures? The reason lies in two main aspects:
- Asynchronous replication mode. Ideally, replication from a primary database to a secondary database is synchronous and complete, meaning, the primary and secondary database are an exact replica of each other at every point in time (aka, at every transaction commit). However, this is not always possible due to network latency, network failures, server outages, etc. It is more often than not the case that the secondary database is lagging behind the primary database (if only for a few transactions) and is missing transactions when a failure occurs in such a situation. In case of a fail-over in such a situation applications will not see committed data in the secondary database that they have already seen committed in the primary database (a data loss occurred – or better: transaction loss).
- Non-transactional resource management. Many applications manage data outside the scope of transactions. In case of failures these non-transactional data can be out of sync with the transactional data (e.g., primary database). And in case of data loss during a fail-over the non-transactional data might be inconsistent with the state of the secondary database also. Any replication mechanism between databases will not include non-transactional data that is managed outside the database.
A background discussion on resource managers and transactions follows now as preparation for the software architecture taxonomy discussion.
Background: Resource Manager
A resource manager in context of database software architectures is a component that manages state (data). In general, this is data creation, update, delete and read. For example, a database is a resource manager, and so is a file system. Resource managers are
- Transactional or non-transactional. Transactional resource managers are able to participate in a transaction protocol (e.g., local transactions, distributed transactions, autonomous transactions or chained transactions, to name a few).
- Persistent or non-persistent. For example, a database is (usually) persistent, a software cache is (usually) non-persistent.
- Non-persistent rebuildable or non-persistent not rebuildable. A non-persistent resource manager might be able to rebuild data from persistently stored data.
For example, the file system on Linux is non-transactional, Oracle’s DBFS is transactional [http://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob_fs.htm#ADLOB45943].
Resource Managers Are Everywhere
Resource managers are everywhere in a software architecture in general. Not only databases or file systems are resource managers, but also queuing systems, software caches, user interfaces, Java JVM caches, or Java class variables. Everything that manages state is to be considered a resource manager in context of the database fail-over discussion.
Most resource managers are non-transactional (they cannot participate in a transaction protocol), and many are non-persistent. An argument can be made that all resource managers could/should participate in transactions to avoid inconsistencies anywhere in the software stack (transactional memory would be an interesting tool in this regard). That would be ideal as any data state change would be consistent across all resource managers.
Resource Managers and Transactions
However, in reality, this is not the case: most resource managers are non-transactional and consequently the data that they manage are not under transaction control. In reality, access to resource managers happens within and outside transactions and not all resource managers can participate in a transaction protocol.
Worse, the same data item might be transactionally committed, and additionally stored in a non-transactional resource manager as a duplicate. In non-failure cases this works out fine, but sometimes, when failures occur, only one or the other update happens, leaving an inconsistent system state behind, as the data item is only in one location and not duplicated in others as expected. If an assumption is made in the software architecture that the data in all resource managers is always consistent, then this situation might cause significant system degradation or errors when failures occur. In the worst case inconsistent application state is left behind.
The same situation occurs if the application writes two data items to be consistent, on through a transactional resource manager, and one through a non-transactional resource manager. If a failure occurs after the first update, the second will not happen and the state is inconsistent.
System Semantics: Consistency
For the purposes of this blog the system semantics wrt. state consistency can be reduced to simply two cases:
- Consistent system: The overall system is consistent when all data in resource managers are consistent with each other (or can be made consistent after a failure before resuming processing)
- Inconsistent system: Otherwise.
The following system consistency guarantees can be given.
- Only transactional and persistent resource managers are used and all changes happen in context of a transaction protocol. This is the Happy Path case from the previous blog as the complete state is consistent at all times.
- If additionally non-persistent and rebuildable resource managers are used, a consistent state can be re-created by removing all data from the non-persistent and rebuildable resource managers.
- Possibly consistent
- Transactional and non-transactional persistent resource managers are used. This might result in consistent state changes, or might not, depending on the timing of the failure. Because of the persistent nature it might be non-trivial to clean-up persistent inconsistent state and possibly complex compensation functions have to be developed (if at all possible) to clean up the inconsistent state.
- Non-persistent and non-rebuildable resource managers are used in conjunction with persistent resource managers. Like before, this might be non-trivial to clean-up and an inconsistent state might be left behind.
If several transactional resource managers are used to change state consistently, but cannot be invoked in context of a single transaction in one transaction protocol, then any failure might leave an inconsistent state behind. This must be called out explicitly here in order to avoid the assumption that transactional means consistent in itself.
The software architecture taxonomy relevant for database fail-over can be built based on the combinations of resource manager types used. In the following the various combinations are discussion on a high level (X means that the software architecture uses one or more of the indicated resource manager types).
|Software Architecture||Transactional Persistent||Non-transactional Persistent||Non-transactional and Non-persistent and rebuildable||Non-transactional and Non-persistent and Non-rebuildable|
Each of these will be discussed separately in an upcoming blog.
In terms of database fail-over, only those software architectures can fail-over without any state inconsistency that can maintain a consistent state across all data or can rebuild a consistent state after a failure. The reason for this is that only consistent state is replicated and any state outside the database is consistent also.
The rationalization in this blog shows that software applications will have to deal with inconsistent data states once non-transactional resource managers are used to manage data. In context of fail-over this becomes an even bigger problem and the next blogs will discuss strategies to deal with data state inconsistencies in software architectures.
This was a high-level discussion only and future blogs will discuss each software architecture in more detail to shine a light on the important and interesting details in case of database fail-over.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.