NoSQL: No-Tation

This is too good to pass as there is a ton of truth in this simple presentation below. It is cited from http://java.dzone.com/articles/history-databases-”no-tation” without modification:

NoSQL_No_Tation

What Flavor of SQL?

The question now is: what flavor of SQL is the SQL in “No, SQL!”?

One alternative is to start from scratch and invent a SQL grammar and semantics that is not based on the existing SQL standard and SQL implementations. That would not be my choice at all, for the obvious reasons.

Another alternative is to treat JSON ‘just’ as a data structure and embed it into existing SQL. This would be roughly the equivalent of the object/relational model (~NF2) with the variation that there is no global schema and some non-standard SQL types.

The (in my opinion) better alternative is to look at JSON as more than just a data structure and combine JSON with the schema-per-document paradigm as the underlying philosophy.

  • This would be a real game changer as it would cater to the dynamic changing JSON data structures as needed in many projects.
  • As a result, the schema-per-document model would be integrated with the relational model so that a single database management system can support several data models and their paradigms concurrently.

Extending SQL Semantics

Supporting a schema-per-document philosophy requires extending the SQL semantics and its operators to be able to deal with the fact that different JSON structures might have a different schema, even if they are in the same collection or relational table.

Operators like typeOf() come to mind that can test if a path into a JSON structure refers to a specific JSON type or JSON literal name. Another operator is exists() to test if a path into a JSON structure exists. Or array operators that can determine if an array contains a specific JSON structure as an element. containsPath() is an operator that checks if a JSON structure has the path given.

Many more operators can be envisioned to make managing and querying JSON structures in context of SQL expressive and useful.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

Advertisement

NoSQL Databases: Data First, Schema Second? Or Vice Versa?

When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?

Global vs. Local vs. Mixed Document Schema

One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.

A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.

A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.

In terms of system examples:

  • Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
    • Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
  • MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
    • There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
  • FoundationDB [https://foundationdb.com/] follows the global schema approach.
    • FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
  • Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
    • A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.

Explicit vs. Implicit Document Schema

A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).

In terms of system examples:

  • Oracle NoSQL: supports both, explicit and implicit schemas
  • MongoDB: implicit schema
  • FoundationDB: explicit schema
  • Oracle 12c: implicit schema

From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.

The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.

Mandatory Document Schema

A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.

In terms of system examples:

  • Oracle NoSQL: no mandatory schema
  • MongoDB: no mandatory schema
  • FoundationDB: requires a mandatory schema
  • Oracle 12c: no mandatory schema

Roles of Database Users

Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:

  • Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
  • Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.

A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.

Schema First or Schema Second?

From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:

  • Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
    • local schema – implicit schema – not mandatory schema (‘schema second’)
  • Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
    • global schema – explicit schema – mandatory schema (‘schema first’)

The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.

A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.

A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.

Co-existence of Schema First and Schema Second?

In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:

  • ETL (extract – transform – load) from data collector database to data user database
    • This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
  • View on data collector database
    • This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
  • Automatic schema extraction
    • A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
  • Intelligent ORM Layer
    • An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.

From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.

If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.

Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.

Schema-Varying Languages?

It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?

While this is a huge topic on its own, server-side JavaScript might be a good place to start as the language is not based on a class/instance paradigm, but prototype approach. It’s type system is almost equivalent with JSON. The language, therefore, is able to represent documents with local schema easily and effortlessly. Since JavaScript can introspect objects and since it implements the prototype mechanism it is possible to represent local functionality for documents with a local schema. One of the bigger questions is how to represent this flexibility to the end user on user interfaces in an ergonomic way.

However, this is a discussion on its own and I’ll save it for a later blog.

Summary

When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.

So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.

Schema-free Database (Part 2): Relational Database Management System (RDBMS)

As outlined in Part 1 of this series (https://realprogrammer.wordpress.com/2013/11/02/schema-free-database-part-1-an-oxymoron/), a ‘schema-free database’ is an oxymoron and in fact the notion of schema is changing from a more restrictive to a more flexible interpretation in context of NoSQL database technology.

So it is only consequential to ask the question the other way around (as a thought experiment): is it possible to build a relational database management system that does not enforce a schema, and if so, how would such a system look like on an abstract level?

Yes, it is possible to have a non-schema-enforcing RDBMS. Let’s discuss two variations next.

Definition of No-Schema-Enforcing Relational Database Management System

What functionality would be altered in order to provide a no-schema-enforcing RDBMS? If it were possible to create a table without specifying columns (aka, only a table name), and then to insert, update and delete rows, then a ‘schema-free’ RDBMS would be in place. This would mean in detail:

  • Rows do not have to comply to a schema when inserted into a table. Different rows in the same table could have different attributes (columns) and the same attributes of different rows could have different domains (flexible type system).
  • By defining a table without specifying columns (names and domains), a table would not define a primary key, either (neither a simple, nor a composite key). Applications inserting or updating rows can behave nicely and add properties with values that comply to the primary key semantics, but the RDBMS would not be aware of it and consequently would not enforce primary key compliance.
  • By the same token, foreign keys would not be enforced by the RDBMS for the same reasons.
  • Since no primary key enforcement is in place, duplicate rows will not be prevented by the RDBMS and any supervision is left to the application systems.
  • Indexes are independent of schema specification and assuming that indexes are maintained on tables, not all rows might be present in an index if the attributes defined by the index are not contained in a row.

Surprisingly (or not), defining a no-schema-enforcing RDBMS is pretty straight forward.

Variation on No-Schema-Enforcing RDBMS

An interesting variation of a no-schema-enforcing RDBMS could be that a schema, primary keys, foreign keys, etc., are specified as usual, however, without being actively enforced; instead, warnings are given by the RDBMS. For example, a row not complying to the schema can actually be inserted, but the result would not be a ‘OK’, but a warning indicating a schema violation.

This can be described as a ‘middle ground’ in widening the schema interpretation where the RDBMS is aware of a schema and warns of violations without rejecting the various DML operations.

Characterization of No-Schema-Enforcing RDBMS

Could a no-schema-enforcing RDBMS (any of the variations) be a useful database management system? Yes, as it would be the equivalent (on the relational model) to NoSQL databases (on JSON/BSON model or key value model).

For use cases where the flexible schema interpretation is key, such a no-schema-enforcing RDBMS could fit the bill (possibly better) than a NoSQL database system if the use case is fundamentally relational in nature (as opposed to e.g. hierarchical or key/value) and if SQL as the query language is important.

Further Exploration

There are additional areas in a RDBMS that will have to change their behavior in a no-schema-enforcing implementation. Only briefly (and not exhaustively), these are

  • Triggers. Triggers are specified on tables and state changes of rows. If particular attributes are referenced inside the trigger, then not every update, insert, read or delete will execute the trigger logic.
  • Stored procedures. Stored procedures often have parameters of specific types and assume a specific set of attributes when processing rows. In a no-schema-enforcing situation the stored procedure has to be able to deal with variations of rows.
  • Functions and function extensions. Functions have to be changed similarly to stored procedures. Not only from the viewpoint of parameters, but also the processing logic.
  • Aggregation. Aggregation will have to change in various ways as the various aggregation functions cannot assume that all attributes are of the same type. Neither can they assume that all attributes are actually present in all rows of a table.

In principle, every concept and every implementation aspect of a RDBMS needs to be re-examined wrt. a wider and more flexible interpretation of ‘schema’. NoSQL systems, by their definition and approach, started with a wider interpretation and consequently made all the conceptual and implementation decisions. They are one source of approach in this regard.

Contact Me

If you plan to explore or to build a no-schema-enforcing RDBMS, please contact me.

Schema-free Database (Part 1): An Oxymoron

The notion of a ‘schema-free database’ keeps coming up, most recently in a meetup I attended a few days ago. Some rationalization follows divided up into the categories of ‘document’ and ‘database’.

While a generalization is easily possible, the context here will be JSON and MongoDB as these are two practical implementations that are available and often used as examples of a ‘schema-free database’. Those provide a nice constraint technology set as an example, while the principles apply to a whole range of other technology, of course.

Document

A JSON document, in short: document, follows a set of construction principles outlined here: http://www.json.org/. This is a rather informal grammar that defines how a valid JSON document is constructed. There are no data type generators and so new data types cannot be introduced; therefore, every document is constructed from the fixed set of types enumerated on that web page.

Document Schema

An attempt has been made to create a more formal mechanism to define a schema for JSON documents: http://json-schema.org/. This approach provides a formal language to describe the schema of a JSON document explicitly.

The json-schema approach combined with the fixed set of types available to create a JSON document means that every JSON document can be described explicitly using json-schema without exceptions. This in turns means that every JSON document has at least an implicit schema, unless it is additionally made explicit with e.g. json-schema.

Therefore, JSON documents have a schema, an implicit one and optionally an explicit one. Depending on the particular schema definition language approach itself, a document might match more than one schema, but that is left for a separate discussion.

Set of Document Schemas

Given a set of JSON documents it is now possible to characterize their relationship to schemas. In the ‘best’ case, all documents follow the same schema; in the ‘worst’ case, each document follows its own schema. And there are cases in-between where a subset of the documents validates against a schema, and another subset against another schema. Depending on the design, one JSON document might validate against different schemas.

The relationship between documents and schemas is n:m in general.

Database

In context of a database, there are a few interesting questions in this context:

  • Does the database understand a document representation (e.g. JSON)?
  • Does the database enforce a document representation?

And:

  • Does the database enforce a schema?
  • Does the database understand a schema?

Let’s answer these questions for MongoDB specifically:

And:

  • MongoDB enforces a partial schema. Each document must have a property called “_id”; if the document being inserted does not have such a property, one is automatically added.
  • MongoDB does not understand an explicit schema as it does not provide for a mechanism to load a schema definition language.

MongoDB, however, understands implicit schemas as MongoDB does allow to e.g. create an index on any property of documents. So MongoDB recognizes properties.

Furthermore, MongoDB supports aggregation functions and supports e.g. the sum of properties across documents (https://realprogrammer.wordpress.com/2012/11/04/null-undefined-nan-and-missing-property-goto-considered-harmful-part-2/). So it is data type aware and implements operators (e.g. sum) on those.

Conclusion: The Notion of ‘Schema’ is Changing

This rather brief discussion clearly rationalizes that the label ‘schema-free database’ is not applicable to technologies such as those discussed in this blog (JSON/BSON, MongoDB).

Hence these technologies are not an example of ‘schema-free database’, to the contrary: they demonstrate that the notion of ‘schema’ can have a wider and more flexible interpretation then what relational databases bring forward.

Schema Derivation: Schema-less vs. Schema-based Documents

In the world of document-oriented databases there is the notion that documents are ‘schema-less’ because no schema has to be defined and registered with the database before documents can be stored. Sometimes this is also referred to as ‘schema-free’. Let’s investigate this further: does ‘schema-free’ really exist?

JSON

In MongoDB, for example, documents use the JSON (http://www.json.org/) structure (with an extended set of types as defined in BSON http://bsonspec.org/). So all document in the database follow the same underlying data type structure. The database will reject a document that is not JSON/BSON compliant.

Schema Enforcement vs. Schema Freedom

In a relational database the notion of schema means that the database knows about the schema (e.g. structure of a table) and all data stored must follow that structure. The database management system is the enforcing entity of the schema. Data not following the schema will be rejected.

In a document-oriented system like MongoDB the notion of schema is not known to the database and therefore not enforced by the database. Any document can have any structure as long as it is valid according to JSON.

Explicit Schema vs. Implicit Schema

In the case of relational databases, the schema is explicit as it is possible to retrieve the schema as a separate data structure from the database. Relational databases have system tables that contain their schema. So the database is aware of the schema in such a way that it can be retrieved.

In a document-oriented database it is not possible to inquire the schema as it has not been defined to the database. However, it could be the case that all documents follow a schema (that is defined outside the database). If this is the case, the schema of the documents can be called an implicit schema. It is not necessary for the database to enforce a schema for the documents to comply to one.

Schema-less, Schema-free

The fact that a document-oriented database does not enforce a schema of its documents does not mean that there is no schema. It only means that the database does not enforce it. However, it could be enforced elsewhere. For example, the software that writes the documents can enforce their schema. Or the software that reads the documents enforces the schema by only accepting those documents that comply to its expectations.

But what if a user manually defines a documents and stores it?

Individual Schema vs. Global Schema

A different viewpoint is possible. If a user stores a manually designed document then it is possible to view this document of having an individual schema. It is the only one with that schema. If the user now adds a second document that has the exact same structure, then it is possible to view these as two documents with the same implicit schema. At this point there are two documents following the same (implicit) schema. It got a bit more global, tough, as now two documents are following it. If all documents follow the same (implicit) schema, then there is an (implicit) global schema.

Different documents can follow different (implicit) schemas. In the extreme case, all documents are different from each other in their structure and each has its own schema.

Schema Derivation

JSON is a construction principle for hierarchical data structures. Each document following the JSON model is hierarchical and finite as JSON does not support (cyclic) references.

Based on the hierarchical and finite structure of JSON it is possible to derive the implicit schema of the JSON documents and make it explicit. On a very high level, the schema derivation algorithm iterates over all documents in the database and determines the schema for each of it. Documents that follow the same schema are grouped together. After all documents have been inspected, there will be one (and possibly more) schema and the associated complying documents.

Details of this algorithm will be discussed in a different blog at some later time.

Summary

Documents have a schema, there are not schema-less or schema-free. The schema of a document is implicit (and can be made explicit). Several documents can follow the same schema. While the database is not enforcing the schema, it could be enforced by a separate layer outside the database.

So, not-enforcement of a schema by the database != schema-less/schema-free.