Relational Data in a Document-oriented NoSQL Database (Part 3): Normalization

Relational database systems support the normalization of relational schemas; what about document-oriented databases?

Normalization

Normalization is a data/schema modeling activity that (if fully applied) results in a schema that has no redundancies and the number of dependencies is minimal (see http://en.wikipedia.org/wiki/Database_normalization for a brief discussion).

In very abstract terms, any change in data is only done in one relation and through means of relationships the rest of the schema stays consistent. This means that in order to support a well-normalized schema queries will contain joins to combine data that is spread out over several tables due to the normalization process.

Document Normalization

Given the concept of normalization in relational systems: is is possible or necessary to normalize documents, too?

First of all, it is possible to normalize data that is organized in documents. The naive approach is to build up the documents so that only scalar properties are stored, roughly equating to tables. In addition, in order to establish references between documents, either foreign-key type properties or documents denoting relationships between documents are created. The process of normalization can be followed like in a relational system.

However, in general document-oriented systems lack the join query operator and so combining documents when querying is not possible. Therefore the combination of data has to be done in the application layer (usually resulting in several queries). This is definitely a down-side of trying to normalize data in document-oriented systems.

Furthermore, the idea behind using document-oriented databases is for storing data that ‘naturally’ consists of hierarchical structures that are fetched together (the 80% case) as complete hierarchies and not normalized chunks of data. And example is a blog and its comments. Comments belong to the blog and without the blog they do not make a lot of sense. In general, when a blog is queried, its comments often are required, too, and so having the blog and its comments in one document makes a lot of sense. The blog can be retrieved in one query.

However, there are cases where separating data into different documents might make sense also. For example, a company might build an e-commerce web site and provides a shopping cart. Users add items to shopping carts. A shopping cart might be represented as one document. In this case it does not make a lot of sense to store the complete item description in every shopping cart that has that item in it. It might be better to have the item name and identifier in the cart with a reference to the item description residing in a different document.

Discussion

While in the relational world it is generally advised to normalize a schema, in the document-oriented world the separation is probably driven by the 80/20 rule of data access. Meaning, to structure documents in such a way that 80% of the queries are fast and fetch all required data in one round trip. Over time, more elaborate rules might be established.

In context of MongoDB a very interesting initial discussion can be found in ’50 Tips and Tricks for MongoDB Developers’ (http://shop.oreilly.com/product/0636920019893.do). This book has a few more elaborated rules and it clearly hints that normalization is for sure an important aspect of document-oriented databases.

Relational Data in a Document-oriented NoSQL Database (Part 2): Schema

The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that?

Relational Table vs. Document Collection

Terminology-wise, relational tables are in the domain of relational database management systems. Tables contain data in form of rows. Document collections are in the domain of some NoSQL databases that support the document structure (e.g., MongoDB). Document collections contain data in discrete documents. Document collections are used as the basis for the following discussion.

One or Several Schemas?

Upfront, document oriented databases have some form of schema built-in. First, there is the concept of collections. Before any document can be stored, a collection must be in place. Second, data to be stored in collections must be documents complying to a document data structure, in many cases this is JSON.

These constraints means that each document is structured according to JSON. If a given document is seen as an instance of a document schema, then each document actually complies to that schema. That schema is implicit as it is not externalized and represented separately. If each document in a collection is different, then there are as many (implicit) schemas as there are documents. If all documents have the same internal structure, then all comply to the same implicit schema.

In the general case, for a given collection, there can be as many implicit schemas as there are documents. In the minimal case, there is no schema (if the collection is empty) or one schema if all documents comply to the same implicit schema. In contrast, in the relational tables, all rows always comply to the table definition.

Schema Enforcement

A relation enforces the structure of its rows. In contrast, a collections does not enforce the structure of its documents. As long as a document is in a consistent data type (e.g. JSON), it can be stored in any collection.

In a given collection it is possible that all documents have the same internal structure. In this case they would all follow the same schema. In the extreme case, each document has its own structure not shared by any other document and that then means that each document has its own schema. A collection does not enforce the schema of its documents.

If document schemas have to be enforced, it can only be done outside the document database, either by the code that writes documents to collections (database inbound) or by the code that reads documents from collections (database outbound). In the inbound case, this ensures that all documents are actually of a given structure. In the outbound case, documents that do not comply will never be processed or they will be changed on the fly in order to comply. Alternatively, the reading ocde can throw an exception if it finds a non-compliant document and then the document can be processed in order to make it compliant.

Querying and Programming Model

In a relational world database queries can assume that all rows in a table are of the same structure and of the appropriate type. The programs accessing the database and retrieving tuples can assume the same. From a programming model perspective there is no variation as the structure is fixed and therefore predictable.

Accessing a document-oriented database is different as neither the queries nor the accessing programs can assume any particular document structure (as the database systems does not enforce any structure). Assumptions can only be made if a fixed document schema or a fixed set of variations is enforced elsewhere by convention. This requires that the programming model is extremely aware of the potential variety and has to understand how to deal with this variety.

Discussion

With all these dynamic possibilities, it is therefore no problem to store relational data into a document-oriented database management system. A document-oriented database management system can deal with structured data, even though it cannot enforce the structure. Even in the case that relations change over time can be handled by a document-oriented database system as it allows documents changing their shape over time.

Coming back to the initial question: The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that? The answer might lie in the ease that allows to store documents with different implicit schemas in the same collection.

Big caveat: Storing documents of different schemas is ‘easy’. However, that pushes the complexity of dealing with documents of different implicit schemas to the programming model: it has to be able to deal with the variation. And for the most part, I believe this is uncharted territory as there are no programming language constructs that allow to characterize variation during processing.

Relational Data in a Document-oriented NoSQL Database (Part 1): Universal Relation

Is there an equivalent to the Universal Relation in the document-oriented database world? There is: the Universal Collection.

Universal Relation

‘In relational databases, the universal relation assumption states that one can place all data attributes into a (possibly very wide) table, which may then be decomposed into smaller tables as needed.’ (from: http://en.wikipedia.org/wiki/Universal_relation_assumption).

There is a bit more behind that statement; for example, there is an assumption that the same type of data are stored in a column with the same name. E.g., if there is the concept of a ‘street name’, then all street names will be in a single column, probably named ‘street name’.

Another assumption is that if a data set does not have data for all columns of the Universal Relation (and hardly any does), then the value  in the column is ‘null’. So in general the Universal Relation is very sparse and has a lot of null values in its columns.

A comprehensive discussion can be found in http://www.informatik.uni-trier.de/~ley/db/books/dbtext/ullman89.html  (for further detailed exploration).

NoSQL Database Equivalent: Universal Collection

What is the equivalent of a Universal Relation in a document-oriented database? One organization scheme of document-oriented databases are collections of documents. In such a case a first approximation is that all documents are stored in the same collection, a Universal Collection.

The second step towards a Universal Collection is that all property names of the documents that contain the same type of data are named the same. This is in general possible as the naming is scoped by documents and sub-collections in documents.

In contrast to the Universal Relation, a document in a Universal Collection only has to have the properties that have values. Other properties do not have to be added with a ‘null’ value, they can simply be left out as the document model does not have a fixed schema; each document can only contain the properties it really requires. If this approach is taken, the Universal Collection is not sparse at all in comparison with the Universal Relation, but very compact.

Since the document model does not enforce a strictly typed schema it is possible that the same property is of different data types (in the sense of the JSON model). So it is possible that a property called ‘address’ can be a ‘string’ in one document and a sub-collection consisting of several strings in another document with both containing perfect address data. In contrast to the relational model, this is valid (and fine) in a document model (if at all, it causes problems during processing, but not from a data model perspective).

Discussion

On this level, it is certainly possible to define the equivalent of an Universal Relation in a document model: the Universal Collection. This is interesting as later on this will serve as a starting point for normalization.

In reality I have seen projects that actually store all documents in a single collection as it made it easier to query the system in comparison of distributing or partitioning documents across several collections. The question is, of course, are the same types of data stored in properties with the same property name to ensure the semantic equivalence or not. This topic will re-appear in a later blog post about document model definition.

Relational Data in a Document-oriented NoSQL Database: Overview

This blog starts a series of discussions around the topic of storing relational data in a document-oriented NoSQL database. Each discussion is going to be a separate blog post.

The idea is not to promote storing relational data in a document-oriented database as such. The goal of this series of blogs is to rationalize (to large extent on a blog-level granularity) the relationship between relational and document data and how a relational world can meet a document world (and vice versa).

The starting point of the discussion are the topics in the blog https://realprogrammer.wordpress.com/2012/04/25/relational-database-management-system-rdbms/:

  1. Universal Relation: https://realprogrammer.wordpress.com/2012/05/16/relational-data-in-a-document-oriented-nosql-database-part-1-universal-relation/
  2. Schema: https://realprogrammer.wordpress.com/2012/05/23/relational-data-in-a-document-oriented-nosql-database-part-2-schema/
  3. Normalization: https://realprogrammer.wordpress.com/2012/05/30/relational-data-in-a-document-oriented-nosql-database-part-3-normalization/
  4. Part-Of Relationship: https://realprogrammer.wordpress.com/2012/06/06/relational-data-in-a-document-oriented-nosql-database-part-4-part-of-relationship/
  5. De-Normalization: https://realprogrammer.wordpress.com/2012/06/13/relational-data-in-a-document-oriented-nosql-database-part-5-de-normalization/

More topics might be added as the discussion unfolds.

Based on these discussions I expect that some guidelines will emerge of how to ‘model’ documents in document-oriented databases ‘properly’ and useful criteria for this modeling task. It is also going to be interesting to see if there is a way to determine the relationship between a relational model and a document model in a consistent way.

Modeling documents in a schema-less/dynamic schema world sounds like an oxymoron; however, in the end, transactional applications and analysis software have to access documents and a ‘good’ modeling practice will certainly help those systems in their design and operation.

Computing Technology Inflection Point: Happening Right Now

I believe that computing in industry is at a computing technology inflection point right now based on some supporting observations:

  • Databases. Through the NoSQL movement in the database world a shift is taking place:
    1. New database management systems are being created, made available and put into production
    2. The new database systems are by enlarge schema-less and/or support dynamic schema changes
    3. Many support JSON as data structures or data model that is inherently supported by Javascript
  • Server. The Javascript language is enabled as server-side programming language:
    1. Google’s V8-Engine together with Node.js and its Eco-system makes Javascript a real contender as a server-side language
    2. Javascript does not have a class/instance model and supports schema-less programming, including dynamic schema changes
    3. Integrated Development Environments (IDEs) provide full Javascript support
  • User Interface. The combination of HTML5/Javascript is gaining traction:
    1. HTML5 in conjunction with Javascript is a real powerful web development technology set
    2. Javascript libraries are being built that can span all types of user interfaces and user devices (including mobile)
    3. It is possible to have a streaming query from the database all the way to the user interface; its frictionless

Combining all these observations, two major characteristics of a new computing technology stack are crystallizing very clearly:

  • One cross-layer programming language (Multi-tier Programming)

    • It is possible to use the same language in the User Interface, Server and Database, i.e., across all architectural layers
  • Inherent schema-less computing and dynamic schema change support
    • All layers shift to schema-less technology and models

I believe that this is a real inflection point happening right now and will cause a major shift in system design, engineering and evolution. It leaves behind the notion that each layer in a typical technology stack must have its specialized language. It also leaves behind the notion that every concept across domains should be best represented in the Class-Instance paradigm. It opens up the ability to dynamically evolve within the systems’ schema-less or dynamic schema change capabilities and it supports the reuse of logic and types across all layers without loosing or changing semantics.

If this combination of technology continues to gain major traction, I would not be surprised to see that some of the first operating system prototypes written in Javascript will become mainstream at some point in time. And, of course, it would be interesting to see how far the processor manufacturers are in their thought processes or research projects with putting dedicated Javascript execution functionality on the processors themselves.

Being around at this computing technology inflection points is exciting and professionally lays before us interesting choices in business decisions as well as career decisions.