Schema-free Database (Part 2): Relational Database Management System (RDBMS)

As outlined in Part 1 of this series (https://realprogrammer.wordpress.com/2013/11/02/schema-free-database-part-1-an-oxymoron/), a ‘schema-free database’ is an oxymoron and in fact the notion of schema is changing from a more restrictive to a more flexible interpretation in context of NoSQL database technology.

So it is only consequential to ask the question the other way around (as a thought experiment): is it possible to build a relational database management system that does not enforce a schema, and if so, how would such a system look like on an abstract level?

Yes, it is possible to have a non-schema-enforcing RDBMS. Let’s discuss two variations next.

Definition of No-Schema-Enforcing Relational Database Management System

What functionality would be altered in order to provide a no-schema-enforcing RDBMS? If it were possible to create a table without specifying columns (aka, only a table name), and then to insert, update and delete rows, then a ‘schema-free’ RDBMS would be in place. This would mean in detail:

  • Rows do not have to comply to a schema when inserted into a table. Different rows in the same table could have different attributes (columns) and the same attributes of different rows could have different domains (flexible type system).
  • By defining a table without specifying columns (names and domains), a table would not define a primary key, either (neither a simple, nor a composite key). Applications inserting or updating rows can behave nicely and add properties with values that comply to the primary key semantics, but the RDBMS would not be aware of it and consequently would not enforce primary key compliance.
  • By the same token, foreign keys would not be enforced by the RDBMS for the same reasons.
  • Since no primary key enforcement is in place, duplicate rows will not be prevented by the RDBMS and any supervision is left to the application systems.
  • Indexes are independent of schema specification and assuming that indexes are maintained on tables, not all rows might be present in an index if the attributes defined by the index are not contained in a row.

Surprisingly (or not), defining a no-schema-enforcing RDBMS is pretty straight forward.

Variation on No-Schema-Enforcing RDBMS

An interesting variation of a no-schema-enforcing RDBMS could be that a schema, primary keys, foreign keys, etc., are specified as usual, however, without being actively enforced; instead, warnings are given by the RDBMS. For example, a row not complying to the schema can actually be inserted, but the result would not be a ‘OK’, but a warning indicating a schema violation.

This can be described as a ‘middle ground’ in widening the schema interpretation where the RDBMS is aware of a schema and warns of violations without rejecting the various DML operations.

Characterization of No-Schema-Enforcing RDBMS

Could a no-schema-enforcing RDBMS (any of the variations) be a useful database management system? Yes, as it would be the equivalent (on the relational model) to NoSQL databases (on JSON/BSON model or key value model).

For use cases where the flexible schema interpretation is key, such a no-schema-enforcing RDBMS could fit the bill (possibly better) than a NoSQL database system if the use case is fundamentally relational in nature (as opposed to e.g. hierarchical or key/value) and if SQL as the query language is important.

Further Exploration

There are additional areas in a RDBMS that will have to change their behavior in a no-schema-enforcing implementation. Only briefly (and not exhaustively), these are

  • Triggers. Triggers are specified on tables and state changes of rows. If particular attributes are referenced inside the trigger, then not every update, insert, read or delete will execute the trigger logic.
  • Stored procedures. Stored procedures often have parameters of specific types and assume a specific set of attributes when processing rows. In a no-schema-enforcing situation the stored procedure has to be able to deal with variations of rows.
  • Functions and function extensions. Functions have to be changed similarly to stored procedures. Not only from the viewpoint of parameters, but also the processing logic.
  • Aggregation. Aggregation will have to change in various ways as the various aggregation functions cannot assume that all attributes are of the same type. Neither can they assume that all attributes are actually present in all rows of a table.

In principle, every concept and every implementation aspect of a RDBMS needs to be re-examined wrt. a wider and more flexible interpretation of ‘schema’. NoSQL systems, by their definition and approach, started with a wider interpretation and consequently made all the conceptual and implementation decisions. They are one source of approach in this regard.

Contact Me

If you plan to explore or to build a no-schema-enforcing RDBMS, please contact me.

Advertisements

JSON Graph Structures

As discussed in another blog on this site, JSON is by-value only and does not have reference types built in. So how are graph (reference) structures represented?

Graph Structures

In short, a graph (http://en.wikipedia.org/wiki/Graph_(data_structure)) has nodes and edges between nodes. They might be directional or not directional. A graph might have nodes without any edges. A graph might have cycles or might be acyclic. Independent of the type of graph, how is a graph represented in JSON?

Graph Node Identity and Edges

Before discussing the graph representations themselves, a precursor discussion on edges is in order: identity and existence.

  • Identity. Edges point at nodes and originate from nodes. An edge pointing to a node must be able to identify that node uniquely. One option is to add a unique identifier to every node. An edge pointing to that node then refers to this unique identifier. An alternative approach is to determine if one or several existing properties of a node uniquely identify that node (“composite key”). In the case an edge can refer to such a uniquely identifying set of properties and values.
  • Existence. An edge originates at a node and points to a node; so an edge has a relationship to two nodes. One possibility is to represent an edge explicitly as an object, and that object internally has properties that indicates the two nodes. This approach represents the edge explicitly. Alternatively, edges can be implicit by the edge data being part of the nodes themselves. An edge pointing from node A to node B could be implemented by identifying B in A to indicate that there is an edge from A to B and in B identifying A to indicate that an edge is originating from A to B. In this case the edge is not represented explicitly.

The choice of identifier and edge representation depends on the circumstances. If a composite key is available, adding unique identifiers is not necessary. If edges have to carry their own information (e.g. edge weight), their explicit representation might be more appropriate if this would make their processing easier.

Although easily to assume, graph nodes do not have to be top level documents, they can be embedded documents also. In the latter case, the identification might be more difficult as the embedding location might have to be taken into consideration also.

Alternative Graph Representations in JSON

There are different alternatives representing graphs in JSON. Some of those will be briefly discussed next. The list is by no means exhaustive and specific circumstances might warrant different representations.

Normalized Graph Representation

The most normalized representation is to represent graph nodes explicitly as top-level documents and graph edges explicitly as top level documents. In this case there are as many documents as there are nodes and edges (independent of how the identity is achieved).

In this approach, nodes might contain references to their edges or not. Edges might or might not contain references to the nodes. It is possible that nodes refer to their edges and the edges to their nodes. Any duplication makes the access faster, but the update will require more effort to ensure consistency.

Non-Normalized Graph Representation

In the non-normalized case, either nodes or edges are co-located within a documents. Some possibilities are

  • Node contains its outgoing edges. In this case, the edges are not explicit, but within the node from which the edge originates. This makes accessing outgoing edges fast, but in order to determine incoming edges, all nodes have to be inspected.
  • Node has its incoming edges. This is the opposite case; accessing the incoming edges would be fast, the outgoing would require traversal of all nodes.
  • Node has incoming and outgoing edges. In this case the look-up of edges from a node’s viewpoint is straight forward and fast, however, any change in edges has to be applied to two nodes. So the read access is fast, but the update access requires more effort.
  • Node has complete sub-graph reachable from it. This makes the traversal super-fast if started from a given node as the whole sub-graph is in the node. If the edges have no direction, this would mean the whole graph is in every node (unless the graph is partitioned). At the same time, this means that every node has a potentially huge amount of redundant data representing parts of the graph or even the whole graph.

The reverse can be modeled also:

  • Edges contains related nodes. In this case the nodes are sub-documents of edges and are not represented on their own. If there are many edges that share many nodes, this causes quite a duplication of data. If the graph is rather sparse and separated, this might be quite efficient.

A final approach could be to model the graph as a whole:

  • Complete graph in one document. In this case the whole graph is within one document. Any operation on the graph has to access only this one document. A sub-case would be that any non-graph related data on nodes and edges is outside this one document. So it only contains the graph structure, but all the ‘payload’ of nodes and edges would be in separate documents. This nicely separates the processing of the graph from the processing of node payloads.

There are many addition variations possible in addition to this list. The major point is that the access patterns (like payload, graph operations, etc.) play a major role in deciding on the final structure.

Conclusion

Graphs are important data structures and due to their particular complexity and context in which they are being used in it is not possible to derive to the one best graph representation. It is really necessary to contemplate the use of a graph in a specific context and model it in JSON accordingly.