SQL for JSON and Schema Support (Part 1): Preliminaries

Missing schema support and schema enforcement is touted as a good thing: is it really?

“Schema Free”, “Flexible Schema”, “Schema Per Document”

What is meant by a database supporting the JSON data structure without providing support and enforcement for schemas? Such a database is often characterized as “schema free”, or supporting a “flexible schema” or “schema per document”. What does it mean?

No matter how such a database supporting JSON it is labelled, it does not provide an interface to define, to manage or to enforce schema(s) for the data, aka JSON documents, it is managing (“enforcing” is used in the semantics a relational database enforces a schema). This means that a client (e.g., application code) can store JSON documents that have any form as long as those comply to the JSON (syntax) standard (and possibly proprietary extensions by the database system).

It furthermore means that JSON documents representing instances of the same concepts (like e.g. orders or games or employees) do not have to have the same structure. Those JSON documents can be different from each other, not only in values, but also in structure.

An example follows of a possible scenario (using MongoDB).

Example

The example stores initially two documents that have the same structure, and subsequently their structure diverges through updates. No schema enforcement prevents the changes.

use blog;
db.blogColl.insert({
  "orderId": 1,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 20
    },
    {
      "itemId": 56,
      "numberOrdered": 21
  }]
});
db.blogColl.insert({
  "orderId": 2,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 30
    }, 
    {
      "itemId": 56,
      "numberOrdered": 31
  }]
});
db.blogColl.update({
    "orderId": 1
  }, {
    "$set": {
    "specialInstructions": 
      "Drop of in front, not back of location"
  }
});
db.blogColl.update({
    "orderId": 2,
    "orderLineItems.itemId": 55
  }, {
    $set: {
      "orderLineItems.$.color": "transparent"
  }
});

The ability to store different JSON documents with different structures, even if they represent (instances of) the same concept, can be seen as a powerful feature. It allows modifying the data as needed to represent changing requirements or specific representation needs. Data migration is easier, too, as data can be changed in place.

This flexibility also has downsides and (engineering) cost that need to be considered and dealt with in a concrete implementation.

Application Implementation

Application code accessing a database has a full or partial representation of the data it queries (or in general manages) in the type system of the deployed programming language.

There are basically two choices an application (short for application code) has to represent data:

  • Direct representation
  • Indirect representation

In a direct representation the concept as stored in the database is defined as data structure in the programming language. For example, using Java as the programming language example, an order is represented as a Java class “Order”. This class has all the methods required to access the various elements of an order (that might be implemented as Java classes themselves). In this approach an order stored in the database, when queried, will be managed as an instance of the Java class Order in the application code. Methods support access or modification to the instance of order, and the methods are order semantics specific, like getOrderDate() or updateOrderLineItem() or totalNumberInidividualItems().

In the indirect representation, an order would be represented not as instance of a Java class that reflects the concept, but an instance of a “meta” Java class. This “meta” class is able to store all data from the database, not just orders. Such a class would have methods like createInstance(), setIdentifier(), setType(), addAttribute(), etc. A type would be “Order”, an attribute would be “lineItem”, etc.

Consequences Of Choice

Given the two representations discussed earlier (direct and indirect), the choice seems to be clear. While the direct representation can capture the semantics of a concept directly (aka, a Java class “Order” can implement order specific methods), this approach would not be able to easily (or at all) deal with changes in the database representation of orders. For example, if an additional attribute is added to the JSON document representing an order (as shown above), the Java class would not be able to change dynamically and capture it.

The indirect representation, however, would not have any problems representing order JSON objects with different structure as e.g. attributes can be dynamically added. This means that instances of the “meta” Java class can represent any JSON document as stored in the database.

These “meta” Java classes are actually already available in form of JSON processing libraries. Such libraries support the creation of JSON structure representations and they can represent any JSON object (or JSON array) that implements a correct JSON syntax.

Summary

At a first glance it looks like databases that support JSON without enforcing a schema at the same time are a good choice for ease of data management. And, using the indirect representation approach applications can deal with dynamically changing JSON objects or JSON objects of different structure representing the same concept.

However, as always, there are more details to discuss and additional aspects are going to be examined in the next blog.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

Advertisements

Relational Data in a Document-oriented NoSQL Database (Part 2): Schema

The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that?

Relational Table vs. Document Collection

Terminology-wise, relational tables are in the domain of relational database management systems. Tables contain data in form of rows. Document collections are in the domain of some NoSQL databases that support the document structure (e.g., MongoDB). Document collections contain data in discrete documents. Document collections are used as the basis for the following discussion.

One or Several Schemas?

Upfront, document oriented databases have some form of schema built-in. First, there is the concept of collections. Before any document can be stored, a collection must be in place. Second, data to be stored in collections must be documents complying to a document data structure, in many cases this is JSON.

These constraints means that each document is structured according to JSON. If a given document is seen as an instance of a document schema, then each document actually complies to that schema. That schema is implicit as it is not externalized and represented separately. If each document in a collection is different, then there are as many (implicit) schemas as there are documents. If all documents have the same internal structure, then all comply to the same implicit schema.

In the general case, for a given collection, there can be as many implicit schemas as there are documents. In the minimal case, there is no schema (if the collection is empty) or one schema if all documents comply to the same implicit schema. In contrast, in the relational tables, all rows always comply to the table definition.

Schema Enforcement

A relation enforces the structure of its rows. In contrast, a collections does not enforce the structure of its documents. As long as a document is in a consistent data type (e.g. JSON), it can be stored in any collection.

In a given collection it is possible that all documents have the same internal structure. In this case they would all follow the same schema. In the extreme case, each document has its own structure not shared by any other document and that then means that each document has its own schema. A collection does not enforce the schema of its documents.

If document schemas have to be enforced, it can only be done outside the document database, either by the code that writes documents to collections (database inbound) or by the code that reads documents from collections (database outbound). In the inbound case, this ensures that all documents are actually of a given structure. In the outbound case, documents that do not comply will never be processed or they will be changed on the fly in order to comply. Alternatively, the reading ocde can throw an exception if it finds a non-compliant document and then the document can be processed in order to make it compliant.

Querying and Programming Model

In a relational world database queries can assume that all rows in a table are of the same structure and of the appropriate type. The programs accessing the database and retrieving tuples can assume the same. From a programming model perspective there is no variation as the structure is fixed and therefore predictable.

Accessing a document-oriented database is different as neither the queries nor the accessing programs can assume any particular document structure (as the database systems does not enforce any structure). Assumptions can only be made if a fixed document schema or a fixed set of variations is enforced elsewhere by convention. This requires that the programming model is extremely aware of the potential variety and has to understand how to deal with this variety.

Discussion

With all these dynamic possibilities, it is therefore no problem to store relational data into a document-oriented database management system. A document-oriented database management system can deal with structured data, even though it cannot enforce the structure. Even in the case that relations change over time can be handled by a document-oriented database system as it allows documents changing their shape over time.

Coming back to the initial question: The ‘schema-less-ness’ of document oriented databases is touted as a major plus and advantage for these systems. Why is that? The answer might lie in the ease that allows to store documents with different implicit schemas in the same collection.

Big caveat: Storing documents of different schemas is ‘easy’. However, that pushes the complexity of dealing with documents of different implicit schemas to the programming model: it has to be able to deal with the variation. And for the most part, I believe this is uncharted territory as there are no programming language constructs that allow to characterize variation during processing.

Computing Technology Inflection Point: Happening Right Now

I believe that computing in industry is at a computing technology inflection point right now based on some supporting observations:

  • Databases. Through the NoSQL movement in the database world a shift is taking place:
    1. New database management systems are being created, made available and put into production
    2. The new database systems are by enlarge schema-less and/or support dynamic schema changes
    3. Many support JSON as data structures or data model that is inherently supported by Javascript
  • Server. The Javascript language is enabled as server-side programming language:
    1. Google’s V8-Engine together with Node.js and its Eco-system makes Javascript a real contender as a server-side language
    2. Javascript does not have a class/instance model and supports schema-less programming, including dynamic schema changes
    3. Integrated Development Environments (IDEs) provide full Javascript support
  • User Interface. The combination of HTML5/Javascript is gaining traction:
    1. HTML5 in conjunction with Javascript is a real powerful web development technology set
    2. Javascript libraries are being built that can span all types of user interfaces and user devices (including mobile)
    3. It is possible to have a streaming query from the database all the way to the user interface; its frictionless

Combining all these observations, two major characteristics of a new computing technology stack are crystallizing very clearly:

  • One cross-layer programming language (Multi-tier Programming)

    • It is possible to use the same language in the User Interface, Server and Database, i.e., across all architectural layers
  • Inherent schema-less computing and dynamic schema change support
    • All layers shift to schema-less technology and models

I believe that this is a real inflection point happening right now and will cause a major shift in system design, engineering and evolution. It leaves behind the notion that each layer in a typical technology stack must have its specialized language. It also leaves behind the notion that every concept across domains should be best represented in the Class-Instance paradigm. It opens up the ability to dynamically evolve within the systems’ schema-less or dynamic schema change capabilities and it supports the reuse of logic and types across all layers without loosing or changing semantics.

If this combination of technology continues to gain major traction, I would not be surprised to see that some of the first operating system prototypes written in Javascript will become mainstream at some point in time. And, of course, it would be interesting to see how far the processor manufacturers are in their thought processes or research projects with putting dedicated Javascript execution functionality on the processors themselves.

Being around at this computing technology inflection points is exciting and professionally lays before us interesting choices in business decisions as well as career decisions.