Schema-free Database (Part 2): Relational Database Management System (RDBMS)

As outlined in Part 1 of this series (https://realprogrammer.wordpress.com/2013/11/02/schema-free-database-part-1-an-oxymoron/), a ‘schema-free database’ is an oxymoron and in fact the notion of schema is changing from a more restrictive to a more flexible interpretation in context of NoSQL database technology.

So it is only consequential to ask the question the other way around (as a thought experiment): is it possible to build a relational database management system that does not enforce a schema, and if so, how would such a system look like on an abstract level?

Yes, it is possible to have a non-schema-enforcing RDBMS. Let’s discuss two variations next.

Definition of No-Schema-Enforcing Relational Database Management System

What functionality would be altered in order to provide a no-schema-enforcing RDBMS? If it were possible to create a table without specifying columns (aka, only a table name), and then to insert, update and delete rows, then a ‘schema-free’ RDBMS would be in place. This would mean in detail:

  • Rows do not have to comply to a schema when inserted into a table. Different rows in the same table could have different attributes (columns) and the same attributes of different rows could have different domains (flexible type system).
  • By defining a table without specifying columns (names and domains), a table would not define a primary key, either (neither a simple, nor a composite key). Applications inserting or updating rows can behave nicely and add properties with values that comply to the primary key semantics, but the RDBMS would not be aware of it and consequently would not enforce primary key compliance.
  • By the same token, foreign keys would not be enforced by the RDBMS for the same reasons.
  • Since no primary key enforcement is in place, duplicate rows will not be prevented by the RDBMS and any supervision is left to the application systems.
  • Indexes are independent of schema specification and assuming that indexes are maintained on tables, not all rows might be present in an index if the attributes defined by the index are not contained in a row.

Surprisingly (or not), defining a no-schema-enforcing RDBMS is pretty straight forward.

Variation on No-Schema-Enforcing RDBMS

An interesting variation of a no-schema-enforcing RDBMS could be that a schema, primary keys, foreign keys, etc., are specified as usual, however, without being actively enforced; instead, warnings are given by the RDBMS. For example, a row not complying to the schema can actually be inserted, but the result would not be a ‘OK’, but a warning indicating a schema violation.

This can be described as a ‘middle ground’ in widening the schema interpretation where the RDBMS is aware of a schema and warns of violations without rejecting the various DML operations.

Characterization of No-Schema-Enforcing RDBMS

Could a no-schema-enforcing RDBMS (any of the variations) be a useful database management system? Yes, as it would be the equivalent (on the relational model) to NoSQL databases (on JSON/BSON model or key value model).

For use cases where the flexible schema interpretation is key, such a no-schema-enforcing RDBMS could fit the bill (possibly better) than a NoSQL database system if the use case is fundamentally relational in nature (as opposed to e.g. hierarchical or key/value) and if SQL as the query language is important.

Further Exploration

There are additional areas in a RDBMS that will have to change their behavior in a no-schema-enforcing implementation. Only briefly (and not exhaustively), these are

  • Triggers. Triggers are specified on tables and state changes of rows. If particular attributes are referenced inside the trigger, then not every update, insert, read or delete will execute the trigger logic.
  • Stored procedures. Stored procedures often have parameters of specific types and assume a specific set of attributes when processing rows. In a no-schema-enforcing situation the stored procedure has to be able to deal with variations of rows.
  • Functions and function extensions. Functions have to be changed similarly to stored procedures. Not only from the viewpoint of parameters, but also the processing logic.
  • Aggregation. Aggregation will have to change in various ways as the various aggregation functions cannot assume that all attributes are of the same type. Neither can they assume that all attributes are actually present in all rows of a table.

In principle, every concept and every implementation aspect of a RDBMS needs to be re-examined wrt. a wider and more flexible interpretation of ‘schema’. NoSQL systems, by their definition and approach, started with a wider interpretation and consequently made all the conceptual and implementation decisions. They are one source of approach in this regard.

Contact Me

If you plan to explore or to build a no-schema-enforcing RDBMS, please contact me.

Schema-free Database (Part 1): An Oxymoron

The notion of a ‘schema-free database’ keeps coming up, most recently in a meetup I attended a few days ago. Some rationalization follows divided up into the categories of ‘document’ and ‘database’.

While a generalization is easily possible, the context here will be JSON and MongoDB as these are two practical implementations that are available and often used as examples of a ‘schema-free database’. Those provide a nice constraint technology set as an example, while the principles apply to a whole range of other technology, of course.

Document

A JSON document, in short: document, follows a set of construction principles outlined here: http://www.json.org/. This is a rather informal grammar that defines how a valid JSON document is constructed. There are no data type generators and so new data types cannot be introduced; therefore, every document is constructed from the fixed set of types enumerated on that web page.

Document Schema

An attempt has been made to create a more formal mechanism to define a schema for JSON documents: http://json-schema.org/. This approach provides a formal language to describe the schema of a JSON document explicitly.

The json-schema approach combined with the fixed set of types available to create a JSON document means that every JSON document can be described explicitly using json-schema without exceptions. This in turns means that every JSON document has at least an implicit schema, unless it is additionally made explicit with e.g. json-schema.

Therefore, JSON documents have a schema, an implicit one and optionally an explicit one. Depending on the particular schema definition language approach itself, a document might match more than one schema, but that is left for a separate discussion.

Set of Document Schemas

Given a set of JSON documents it is now possible to characterize their relationship to schemas. In the ‘best’ case, all documents follow the same schema; in the ‘worst’ case, each document follows its own schema. And there are cases in-between where a subset of the documents validates against a schema, and another subset against another schema. Depending on the design, one JSON document might validate against different schemas.

The relationship between documents and schemas is n:m in general.

Database

In context of a database, there are a few interesting questions in this context:

  • Does the database understand a document representation (e.g. JSON)?
  • Does the database enforce a document representation?

And:

  • Does the database enforce a schema?
  • Does the database understand a schema?

Let’s answer these questions for MongoDB specifically:

And:

  • MongoDB enforces a partial schema. Each document must have a property called “_id”; if the document being inserted does not have such a property, one is automatically added.
  • MongoDB does not understand an explicit schema as it does not provide for a mechanism to load a schema definition language.

MongoDB, however, understands implicit schemas as MongoDB does allow to e.g. create an index on any property of documents. So MongoDB recognizes properties.

Furthermore, MongoDB supports aggregation functions and supports e.g. the sum of properties across documents (https://realprogrammer.wordpress.com/2012/11/04/null-undefined-nan-and-missing-property-goto-considered-harmful-part-2/). So it is data type aware and implements operators (e.g. sum) on those.

Conclusion: The Notion of ‘Schema’ is Changing

This rather brief discussion clearly rationalizes that the label ‘schema-free database’ is not applicable to technologies such as those discussed in this blog (JSON/BSON, MongoDB).

Hence these technologies are not an example of ‘schema-free database’, to the contrary: they demonstrate that the notion of ‘schema’ can have a wider and more flexible interpretation then what relational databases bring forward.

Schema Derivation: Schema-less vs. Schema-based Documents

In the world of document-oriented databases there is the notion that documents are ‘schema-less’ because no schema has to be defined and registered with the database before documents can be stored. Sometimes this is also referred to as ‘schema-free’. Let’s investigate this further: does ‘schema-free’ really exist?

JSON

In MongoDB, for example, documents use the JSON (http://www.json.org/) structure (with an extended set of types as defined in BSON http://bsonspec.org/). So all document in the database follow the same underlying data type structure. The database will reject a document that is not JSON/BSON compliant.

Schema Enforcement vs. Schema Freedom

In a relational database the notion of schema means that the database knows about the schema (e.g. structure of a table) and all data stored must follow that structure. The database management system is the enforcing entity of the schema. Data not following the schema will be rejected.

In a document-oriented system like MongoDB the notion of schema is not known to the database and therefore not enforced by the database. Any document can have any structure as long as it is valid according to JSON.

Explicit Schema vs. Implicit Schema

In the case of relational databases, the schema is explicit as it is possible to retrieve the schema as a separate data structure from the database. Relational databases have system tables that contain their schema. So the database is aware of the schema in such a way that it can be retrieved.

In a document-oriented database it is not possible to inquire the schema as it has not been defined to the database. However, it could be the case that all documents follow a schema (that is defined outside the database). If this is the case, the schema of the documents can be called an implicit schema. It is not necessary for the database to enforce a schema for the documents to comply to one.

Schema-less, Schema-free

The fact that a document-oriented database does not enforce a schema of its documents does not mean that there is no schema. It only means that the database does not enforce it. However, it could be enforced elsewhere. For example, the software that writes the documents can enforce their schema. Or the software that reads the documents enforces the schema by only accepting those documents that comply to its expectations.

But what if a user manually defines a documents and stores it?

Individual Schema vs. Global Schema

A different viewpoint is possible. If a user stores a manually designed document then it is possible to view this document of having an individual schema. It is the only one with that schema. If the user now adds a second document that has the exact same structure, then it is possible to view these as two documents with the same implicit schema. At this point there are two documents following the same (implicit) schema. It got a bit more global, tough, as now two documents are following it. If all documents follow the same (implicit) schema, then there is an (implicit) global schema.

Different documents can follow different (implicit) schemas. In the extreme case, all documents are different from each other in their structure and each has its own schema.

Schema Derivation

JSON is a construction principle for hierarchical data structures. Each document following the JSON model is hierarchical and finite as JSON does not support (cyclic) references.

Based on the hierarchical and finite structure of JSON it is possible to derive the implicit schema of the JSON documents and make it explicit. On a very high level, the schema derivation algorithm iterates over all documents in the database and determines the schema for each of it. Documents that follow the same schema are grouped together. After all documents have been inspected, there will be one (and possibly more) schema and the associated complying documents.

Details of this algorithm will be discussed in a different blog at some later time.

Summary

Documents have a schema, there are not schema-less or schema-free. The schema of a document is implicit (and can be made explicit). Several documents can follow the same schema. While the database is not enforcing the schema, it could be enforced by a separate layer outside the database.

So, not-enforcement of a schema by the database != schema-less/schema-free.

Relational Data in a Document-oriented NoSQL Database (Part 2): Schema

The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that?

Relational Table vs. Document Collection

Terminology-wise, relational tables are in the domain of relational database management systems. Tables contain data in form of rows. Document collections are in the domain of some NoSQL databases that support the document structure (e.g., MongoDB). Document collections contain data in discrete documents. Document collections are used as the basis for the following discussion.

One or Several Schemas?

Upfront, document oriented databases have some form of schema built-in. First, there is the concept of collections. Before any document can be stored, a collection must be in place. Second, data to be stored in collections must be documents complying to a document data structure, in many cases this is JSON.

These constraints means that each document is structured according to JSON. If a given document is seen as an instance of a document schema, then each document actually complies to that schema. That schema is implicit as it is not externalized and represented separately. If each document in a collection is different, then there are as many (implicit) schemas as there are documents. If all documents have the same internal structure, then all comply to the same implicit schema.

In the general case, for a given collection, there can be as many implicit schemas as there are documents. In the minimal case, there is no schema (if the collection is empty) or one schema if all documents comply to the same implicit schema. In contrast, in the relational tables, all rows always comply to the table definition.

Schema Enforcement

A relation enforces the structure of its rows. In contrast, a collections does not enforce the structure of its documents. As long as a document is in a consistent data type (e.g. JSON), it can be stored in any collection.

In a given collection it is possible that all documents have the same internal structure. In this case they would all follow the same schema. In the extreme case, each document has its own structure not shared by any other document and that then means that each document has its own schema. A collection does not enforce the schema of its documents.

If document schemas have to be enforced, it can only be done outside the document database, either by the code that writes documents to collections (database inbound) or by the code that reads documents from collections (database outbound). In the inbound case, this ensures that all documents are actually of a given structure. In the outbound case, documents that do not comply will never be processed or they will be changed on the fly in order to comply. Alternatively, the reading ocde can throw an exception if it finds a non-compliant document and then the document can be processed in order to make it compliant.

Querying and Programming Model

In a relational world database queries can assume that all rows in a table are of the same structure and of the appropriate type. The programs accessing the database and retrieving tuples can assume the same. From a programming model perspective there is no variation as the structure is fixed and therefore predictable.

Accessing a document-oriented database is different as neither the queries nor the accessing programs can assume any particular document structure (as the database systems does not enforce any structure). Assumptions can only be made if a fixed document schema or a fixed set of variations is enforced elsewhere by convention. This requires that the programming model is extremely aware of the potential variety and has to understand how to deal with this variety.

Discussion

With all these dynamic possibilities, it is therefore no problem to store relational data into a document-oriented database management system. A document-oriented database management system can deal with structured data, even though it cannot enforce the structure. Even in the case that relations change over time can be handled by a document-oriented database system as it allows documents changing their shape over time.

Coming back to the initial question: The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that? The answer might lie in the ease that allows to store documents with different implicit schemas in the same collection.

Big caveat: Storing documents of different schemas is ‘easy’. However, that pushes the complexity of dealing with documents of different implicit schemas to the programming model: it has to be able to deal with the variation. And for the most part, I believe this is uncharted territory as there are no programming language constructs that allow to characterize variation during processing.