Flashback/Look-ahead – SQL: The Kingmaker of NoSQL Databases?

Over 1 ½ years ago I suggested that SQL will turn out to be an important factor for the success of NoSQL databases (https://realprogrammer.wordpress.com/2014/03/04/sql-the-kingmaker-of-nosql-databases/). Did that happen?

NoSQL / OldSQL / NewSQL: SQL Language Support on the Rise

When surveying a subset of the various database systems it becomes apparent that the support for relational SQL as well as JSON-supporting SQL is increasing. Right now this does not necessarily mean that there is an agreement on the particular language syntax, its semantic interpretation or execution style (declarative vs. functional vs. procedural). However, there is an (implicit) agreement that a (declarative) query language is necessary or advantageous, that SQL is a guiding principle and that JSON needs to be supported to various extent. A (not necessarily complete) list of different systems follows that have a (JSON-supporting) SQL language implemented in one form or another:

From the above selection it becomes clear that a lot of NoSQL, OldSQL, and NewSQL systems invest in SQL as well as JSON-supporting SQL and that the number of systems that support (JSON-supporting) SQL is definitely increasing.

Given that implementing the efficient execution of SQL is a significant engineering effort (and investment), JSON-supporting SQL is obviously seen as an important feature.

As an aside, an intriguing question is: why do database systems that fall into the category of “NoSQL” databases start providing SQL? It appears that the “No” in “NoSQL” is being re-interpreted from “no-SQL” to “no-Relational Model” with JSON-supporting SQL support. One possible answer to the question is the increasing complexity of queries that are being issued. While simple interfaces are good to get going initially, over time, as applications become more involved, complex queries reflect the increasing domain complexity and that is going to be reflected in the query language as well (for a taste, see here: https://realprogrammer.wordpress.com/2014/10/27/oracle-12c-sql-for-json-part-2-basic-queries/ – toward the end of the blog). Another (additional) possible answer is that existing tools (e.g. reporting tools) that are JDBC compliant and can process relational data should be able to connect also.

JSON-supporting SQL Language: Discussions are Ongoing

Discussions about the benefit of JSON-supporting SQL in context of NoSQL databases are coming up repeatedly (e.g., LinkedIn group discussions: https://www.linkedin.com/groups/2085042 (NoSQL Group)). Interesting enough, most of the time, the discussion is not separated or structured by syntax vs. semantics vs. functionality. SQL is developed in context of the relational model and that does not preclude complex data structures like JSON, the opposite is true. Object / relational functionality as well as NF2 work deals with complex structures, including arrays and maps. So it would be prudent, in my mind, to have a separation in the discussion of

  • Data structure support (aka, JSON)
  • Query functionality
  • Query language syntax

I expect the discussion of JSON-supporting SQL to lead towards a declarative language with SQL syntax, based on the relational model, paired with complex data types (e.g. JSON) and a schema-per-object or schema-per-row semantics.

The interesting aspect on the SQL syntax and interpretation is that systems that also support the relational model extend those to support JSON, and these systems will extend SQL, not replace it (as can be seen already).

SQL Support for NoSQL Databases: Best Approach Forward?

In my mind the best approach forward would be standing on the shoulder of giants: take SQL and extend it to support JSON in a very careful but very expressive way, learning from the object / relational work as well as the NF2 work that took place a long time back. Stay true to the learnings of the benefits of a declarative language and the underlying possibility of optimization, be it functional or non-functional. In short, add the domain type JSON and cater for a schema-per-document model.

If JSON support is one of the common functional denominators of most of the databases then the distinction is not really achieved by variations of (JSON-supporting) SQL, but by variations of the non-functional support like scale, throughput, low latency, safety, reliability, etc.

I believe that JSON-supporting SQL will be converging, while the non-functional features will set the various systems further and further apart in order to provide the designed-for sweet spot and establish their competitive advantage.

Summary

Clearly, the jury is still out, however, compared to 1 ½ years ago, whispers are emerging that JSON-supporting SQL is becoming an important feature of NoSQL, OldSQL and NewSQL databases.

Clearly, the pendulum swings toward the SQL language grammar and functionality that supports JSON. And that is a good thing in my opinion for the various obvious reasons.

Clearly, discussions keep coming up about JSON-supporting SQL, the various styles and language proposals. This is another reason to believe that SQL support in NoSQL databases and in context of JSON is important.

Let’s check in again into this topic next year some time. It is always interesting to follow such a hotly debated topic and establish checkpoints along the way..

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisement

Oracle 12c – SQL for JSON (Part 3): Basic Joins

Having JSON support in a relational database means that the join operator is available not only for joining relational data or JSON data, but also for the mixed case: joining JSON and relational data. This opens up a whole new world of data modeling and query execution.

Running Example

This running example creates three tables, “demo”, “city” and “city_rel” and a sample data set in each table. The tables “city” and “city_rel” contain the same data set, once in JSON format, and once in relational format.

DROP TABLE demo;
CREATE TABLE demo
(
  id NUMBER,
  person CLOB 
    CONSTRAINT person_ensure_json 
    CHECK (person IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO demo VALUES
( 1, '{ "name": "Bob", "city": "SF"}' );
INSERT INTO demo VALUES
( 2, '{ "name": "Jake", "city": "PA"}' );
INSERT INTO demo VALUES
( 3, '{ "name": "Alice", "city": "NYC"}' );
INSERT INTO demo VALUES
( 4, '{ "name": "Jenn",  "city": {"name": "Tokyo"}}' );
INSERT INTO demo VALUES
( 5, '{ "name": "Jenn",  "city": ["Tokyo"]}' );
INSERT INTO demo VALUES
( 6, '{ "name": "Jenn",  "city": 66}' );
DROP TABLE city;
CREATE TABLE city
(
  id NUMBER,
  city CLOB 
    CONSTRAINT city_ensure_json 
    CHECK (city IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO city VALUES
( 101, '{"city": "SF", "state": "CA", 
  "country": "US"}' );
INSERT INTO city VALUES
( 102, '{"city": "PA", "state": "CA", 
  "country": "US"}' );
INSERT INTO city VALUES
( 103, '{"city": "NYC", "state": "NY", 
  "country": "US"}' );
INSERT INTO city VALUES
( 104, '{"city": {"name": "Tokyo"}, "state": null, 
  "country": "Japan"}' );
INSERT INTO city VALUES
( 105, '{"city": ["Tokyo"], "state": null, 
  "country": "Japan"}' );
INSERT INTO city VALUE
( 106, '{"city": 66, "state": null, 
  "country": "World"}' );
DROP TABLE city_rel;
CREATE TABLE city_rel
(
  id      NUMBER,
  city    VARCHAR(255),
  state   VARCHAR(255),
  country VARCHAR(255));
INSERT INTO city_rel VALUES
( 1001, 'SF', 'CA', 'US' );
INSERT INTO city_rel VALUES
( 1002, 'PA', 'CA', 'US' );
INSERT INTO city_rel VALUES
( 1003, 'NYC', 'NY', 'US' );
INSERT INTO city_rel VALUES
( 1004, '{"name": "Tokyo"}', NULL, 'World' );
INSERT INTO city_rel VALUES
( 1005, '["Tokyo"]', NULL, 'World' );
INSERT INTO city_rel VALUES
( 1006, '66', NULL, 'World' );

JSON Join

The following SQL statement is a simple join between JSON structures on the property “city”:

SELECT *
FROM demo d, city c
WHERE d.person.city = c.city.city;

This SQL statement projects in addition to joining JSON structures:

SELECT d.person, c.city
FROM demo d, city c
WHERE d.person.city = c.city.city;

The following SQL statement extends the projection:

SELECT 
  d.id,
  d.person.name,
  d.person.city,
  c.id,
  c.city.state,
  c.city.country
FROM demo d, city c
WHERE d.person.city = c.city.city;

JSON – Relational Join

 This SQL statement shows the join between JSON and relational data, combined with a projection:

SELECT 
  d.id,
  d.person.name,
  d.person.city,
  c_r.id,
  c_r.city,
  c_r.state,
  c_r.country
FROM demo d, city_rel c_r
WHERE d.person.city = c_r.city;

Significance of Pure and Mixed JSON Joins

As shown, the join operator is applied easily within JSON tables and across JSON and relational tables. When using Oracle 12c there is no restriction anymore when it comes to the join operator in conjunction of JSON documents.

Pure JSON joins are possible in context of Oracle 12c. This means that developers have a choice to model all data in a pure document form (trying to avoid the need for joins by creating sub-collections – which is almost impossible without denormalization), or to consciously model documents in such a way that the document nature is applied where applicable without having to necessarily de-normalize as the join operator is available.

The mixed case between JSON and relational tables goes a lot further as now data can be modeled according to its nature (not all data is exclusively document-oriented or relational) and its access path requirements without compromising either way.

In addition, the mixed case supports the situation where data is already present in the database in relational form and new data is added in JSON form. This means that even if data is available in relational form, additional data does not have to be in relational form, and the most appropriate representation can be chosen (and no separate document-oriented database has to be deployed, btw).

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Trending: Multi-Interface and Multi-Data-Model Databases

An interesting development, especially in the NoSQL database space, is the development towards multi-interface and multi-data-model databases, and sometimes both at the same time. While it provides flexibility, it also brings challenges.

Multi-Data-Model Support

In the relational database space, supporting different data models concurrently is not a novelty. Relational databases started off with the relational data model implementation, and later on some of the systems extended the relational model mainly by objects, XML or JSON.

Some databases in the NoSQL space are starting to evolve, too, in this manner and are providing more than one data model concurrently. Some interesting examples are discussed next.

One example in the NoSQL space is Oracle NoSQL [http://www.oracle.com/technetwork/database/database-technologies/nosqldb/]. This system supports a key/value data model whereby the key is used to identify values that are not interpreted by the database itself. In addition, values can be of complex types that are actually interpreted by the database, e.g., in secondary indexes.

Aerospike is another example in the NoSQL space [http://www.aerospike.com/]. Aerospike provides a data model consisting of basic and complex types. In addition, it supports language-native formats as well as large data types that have a specific operational characteristics and data type operations tuned for scale.

Like some relational databases extended data models over time to support specific use cases in a more direct way, some NoSQL databases are also going down that path to more directly support specific application developer needs.

Multi-Interface Support

From an application development perspective a single query API is certainly preferable that provides the complete query expressiveness required. However, especially in the new area of NoSQL databases, it is not clear yet what a good query API actually looks like. This can be observed by different systems providing different query API alternatives.

MongoDB [http://www.mongodb.org/] has a document query interface based on query patterns in form of JSON documents (“Query Documents”). In addition, it provides a map/reduce interface and aggregation pipelines. There are three different APIs that an application developer can choose, and, in addition, they overlap in their functionality. This means that, depending on the query, it can be expressed in all three of them.

Aerospike [http://www.aerospike.com/] provides different language drivers in addition to an Aerospike Query Language.

Cloudant [http://www.cloudant.com], in contrast, supports a REST-api as well as a query interface based on query documents (similar to MongoDB).

Not strictly an external interface, but very important for specific use cases, is the ability to add functionality dynamically to the database in order to move some processing from the application systems into the database itself: user defined functions. For example, MongoDB allows adding functionality through JavaScript functions, whereas Aerospike supports two different types of Lua functions: record user defined functions operate on single records, whereby stream user defined functions support distributed processing.

The Good

Unquestionably, the good part about multi-interface and multi-data-model databases is that an application developer can choose the best combination of data model and access interface for a particular development task. The impedance mismatch between the problem and the solution can be minimized with an appropriate choice.

This also means that developers need to understand the pros and cons of every combination and that requires to go through a learning curve. Going through that learning curve might pay off big time.

In addition, application development teams will have to manage a wider range of implementation variations in terms of application design and engineering, but also in terms of bug fixing and application code maintenance.

The Tricky

The tricky part of multi-interface and multi-data-model databases is that all combinations can be used concurrently, in production as well as post-production (e.g. analytics). Unit and functional tests as well as performance and scale tests become a lot more complicated as they have to test the concurrent execution of various combinations.

Furthermore, many queries can be expressed in different interfaces as those tend to overlap in query expressiveness. So an application developer needs to clearly understand the pros and cons of the query execution that underlies a specific query interface.

Hopefully the query semantics is the same for all combinations (meaning, for example, that predicates are evaluated the same way) and that concurrent use of data models and interfaces does not negatively impact the various clients in terms of concurrency, scale and performance. Any bug introduced through a discrepancy might be very difficult to reproduce and fix.

Summary

While multi-interface and multi-data-model databases are a powerful technology, there is considerable impact to the application system engineering activities in terms of knowledge acquisition, development, test and maintenance.

While database vendors certainly strive to have all combinations work in harmony, there might be edge cases where one combination does not give the same result as a different one. From an application development perspective test coverage should ensure semantic equivalence of the used combinations so that misinterpretations or wrong results are avoided.

Document-oriented NoSQL Databases: Oracle NoSQL’s Shift in March 2014

I believe Oracle NoSQL extended and expanded from a key/value store to a document-oriented NoSQL database in March 2014. If you concur, then this was a major shift in the NoSQL arena of document-oriented databases.

Document-Oriented NoSQL Databases

In a narrow interpretation, document-oriented NoSQL databases use JSON or BSON as the query interface data representation. In a wider interpretation, any scalar data type and any composite data type (like maps or array) is available at the query interface without those necessarily complying to the JSON encoding or interpretation.

Some of the known document-oriented database are MongoDB, CouchDB and RethinkDB, amongst many others. The site http://db-engines.com/en/ranking/document+store has a categorization and ranking readily available.

In addition to storing documents (aka, complex data structures with scalar, array and map data types), another important criteria is the ability to query based on the document content by e.g. selecting documents where specific properties have specific values or comply to specific complex predicates (in the general case).

Key/Value Stores

Key/value stores (k/v stores) are different from document-oriented databases. Key/value stores persist a value for a given key. Retrieval is based on the key and values cannot be used in query predicates. Some k/v stores have structured keys or composite keys that support querying a range or set of values.

Oracle NoSQL, for example, is supporting structured and composite keys (http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/singleget.html, http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/multiget.html).

Aerospike is another example of a k/v store (https://docs.aerospike.com/display/V3/Key-Value+Store+Guide).

K/V Stores: Major Shift

Oracle NoSQL extended its data model from a pure k/v data model to a document-oriented data model in March 2014.

The available simple data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Double
  • Float
  • Integer
  • Long
  • Java byte array
  • Java String

And the available complex data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Array (An array of values, all of the same type)
  • Enum (An enumeration, represented as an array of strings)
  • Fixed Binary (A fixed-sized binary type used to handle binary data where each record is the same size. It uses less storage than an unrestricted binary field, which requires the length to be stored with the data)
  • Map (An unordered map type where all entries are constrained by a single type)
  • Records

In addition, indexes can be created on the ‘value’ part, aka, the documents: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/createindex.html. And based on indexes, the database can be queried without having to specify a key: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/indexread.html.

The latter functionality extends the characteristics of the Oracle NoSQL database from k/v store into a document-oriented NoSQL database. Clearly secondary indexes are not yet a very expressive query language, however, they are a major first step towards it.

What does it mean?

The interesting aspect is that Oracle as a major database vendor moved with a separate and dedicated NoSQL product first into the k/v store space, and now additionally into the document-oriented NoSQL space. There are many ways to read this move, of course, but from a technical viewpoint this emphasizes the importance of a structured data model in the NoSQL space. In addition, OracleNoSQL allows a per-document Avro schema, making it a truly per-document schema database.

Aerospike, for example, also supports complex data types and secondary indexes, even though it is also widely seen as a k/v store (e.g., http://db-engines.com/en/ranking/key-value+store).

Summary

Clearly, the document-oriented segment of NoSQL databases gets more and more crowded and the attention of major players in the database space.

What hasn’t been done is a comparison of the data model and query interface expressiveness of the major players in the document-oriented NoSQL database segment. Once done, and in case that reveals that the expressiveness is the same or almost the same, then the systems are exchangeable from a functionality viewpoint and will distinguish themselves ‘only’ in terms of non-functional properties like latency, throughput, transactions, cost, support, and so on.

SQL: The Kingmaker of NoSQL Databases?

Here is an interesting question: what if all NoSQL and NewSQL databases would implement SQL?

Advanced SQL

“SQL” in this context does not refer to SQL 92 (http://en.wikipedia.org/wiki/SQL-92) or other versions of the purely relational model focused standard, but to a SQL that is extended properly in syntax and in semantics to deal with non-relational data models. Let’s name it “Advanced SQL” for the purpose of this blog.

I furthermore assume that all data in a database would be accessible, not just parts of it. For example, in key/value stores the values’ internal structure would be accessible as long as it contains data accessible in a structured way (e.g., a JSON representation).

Ideally, of course, all NoSQL and NewSQL databases would implement the same new form of Advanced SQL, but for this blog this is not really necessary or a precondition (but it would be nice, though).

If such an Advanced SQL would be available on all NoSQL and NewSQL databases, then

  • the typical SQL operators like select, project, join, and so on are available on all data in a database across all data models like relational, document-oriented, key/value, columnar, graph-oriented.
  • the same data set loaded into each NoSQL or NewSQL database would be accessible through Advanced SQL and the same query would return the same result, no matter which database (best case assumption wrt. SQL execution semantics).

Data Models

In order to be able to load all data sets into all types of databases the data models have to be equivalent in their structural expressiveness. This might not necessarily be the case today, however, the common trend appears to be that databases are either supporting JSON, or structured data containing array types and nested sub-structures alongside elementary types like string and integer.

For the sake of the blog, let’s assume the data models are equivalent in structural expressiveness.

Implication of Advanced SQL on NoSQL / NewSQL Databases

The biggest implication of Advanced SQL on NoSQL / NewSQL databases would be that all NoSQL / NewSQL databases would be equivalent from a functional perspective in terms of accessing data: a given data set could be stored in any of the databases and queried using SQL. A query would (ideally) return the same result in any of the databases storing the same data set.

The big question this bears is: what would distinguish the NoSQL/NewSQL databases in this case?

  • The non-functional aspects.

For example, to name a few:

  • Performance: the same query would take different execution time
  • Transactions: a set of queries would form a single transaction (or not)
  • Scale: the speed and extent a database can scale up/scale down with varying data set size
  • Backup/restore: how fast and how reliable can databases be re-created after disaster

This basically would mean that a user can select a database based on its non-functional characteristics. (Of course, history has shown that the particular SQL implementation of a database will have proprietary extensions also).

Benefits of Advanced SQL

There are several benefits if Advanced SQL like described above would be available.

  • A user can select not only one, but several databases for different needs, while not having to deal with vastly different query interfaces. For example, one for production system access speed and a different one for long-term storage or analytics.
  • Data can be distributed across different systems depending of how the data is used without having to deal with incompatible access interfaces.
  • NoSQL and NewSQL databases can be integrated alongside existing RDBMSs for mutual benefit.

In a nutshell, users are not forced anymore to select one databases for all their needs, but can select the appropriate one for the particular part of their system architecture.

Trending of Non-functional Features

There is an interesting trend ongoing. OldSQL and NewSQL databases start supporting JSON (e.g., Postgres, VoltDB); Informix has a compatible driver interface to the MongoDB wire format; Tokutek and GridGain implement non-functional replacements for MongoDB; Oracle NoSQL supports per-document Avro schemas for the values; Tokutek adds transaction support; and so on. This looks like the early stages of ‘sorting out’ on a data model level as well as non-functional level. Time will tell if this was the starting point of consolidation in the NoSQL/NewSQL space.

King Making

So why, if the differentiation is likely to happen on the non-functional level, is SQL a potential kingmaker? In my opinion, Advanced SQL will be the standard expectation and a database not supporting Advanced SQL will be relegated to being a niche system. So while the presence of Advanced SQL will be a distinguishing factor (possibly a small one), the absence of Advanced SQL will impact the success of a database significantly.

Relational Data in a Document-oriented NoSQL Database (Part 1): Universal Relation

Is there an equivalent to the Universal Relation in the document-oriented database world? There is: the Universal Collection.

Universal Relation

‘In relational databases, the universal relation assumption states that one can place all data attributes into a (possibly very wide) table, which may then be decomposed into smaller tables as needed.’ (from: http://en.wikipedia.org/wiki/Universal_relation_assumption).

There is a bit more behind that statement; for example, there is an assumption that the same type of data are stored in a column with the same name. E.g., if there is the concept of a ‘street name’, then all street names will be in a single column, probably named ‘street name’.

Another assumption is that if a data set does not have data for all columns of the Universal Relation (and hardly any does), then the value  in the column is ‘null’. So in general the Universal Relation is very sparse and has a lot of null values in its columns.

A comprehensive discussion can be found in http://www.informatik.uni-trier.de/~ley/db/books/dbtext/ullman89.html  (for further detailed exploration).

NoSQL Database Equivalent: Universal Collection

What is the equivalent of a Universal Relation in a document-oriented database? One organization scheme of document-oriented databases are collections of documents. In such a case a first approximation is that all documents are stored in the same collection, a Universal Collection.

The second step towards a Universal Collection is that all property names of the documents that contain the same type of data are named the same. This is in general possible as the naming is scoped by documents and sub-collections in documents.

In contrast to the Universal Relation, a document in a Universal Collection only has to have the properties that have values. Other properties do not have to be added with a ‘null’ value, they can simply be left out as the document model does not have a fixed schema; each document can only contain the properties it really requires. If this approach is taken, the Universal Collection is not sparse at all in comparison with the Universal Relation, but very compact.

Since the document model does not enforce a strictly typed schema it is possible that the same property is of different data types (in the sense of the JSON model). So it is possible that a property called ‘address’ can be a ‘string’ in one document and a sub-collection consisting of several strings in another document with both containing perfect address data. In contrast to the relational model, this is valid (and fine) in a document model (if at all, it causes problems during processing, but not from a data model perspective).

Discussion

On this level, it is certainly possible to define the equivalent of an Universal Relation in a document model: the Universal Collection. This is interesting as later on this will serve as a starting point for normalization.

In reality I have seen projects that actually store all documents in a single collection as it made it easier to query the system in comparison of distributing or partitioning documents across several collections. The question is, of course, are the same types of data stored in properties with the same property name to ensure the semantic equivalence or not. This topic will re-appear in a later blog post about document model definition.

Document-oriented Database (NoSQL Database Category)

‘Document-oriented database’ refers to databases that store their data in terms of documents. More specifically, structured documents and, by going with the times,  documents in JSON (http://www.json.org/) format at that.

Some examples are MongoDB (http://www.mongodb.org/) or RavenDB (http://ravendb.net/).

These databases belong to a category of databases called ‘NoSQL Databases’ (http://en.wikipedia.org/wiki/NoSQL). This category emerged over the last few years and in this blog site is used in that sense.

Databases that do not prescribe the content format or enforce a specific data model can store JSON by definition; so database systems that did not decide on a particular data representation upfront but left it open, can be considered document-oriented databases in case their payloads are documents.