Oracle 12c – SQL for JSON (Part 2): Basic Queries

October 27, 2014 by chbussler

This blog provides a small tour of basic SQL queries that operate on JSON in context of the Oracle Database 12c Release 1 (12.1.0.2.0).

Sample Data Set

The following very basic data set is used as an example in this blog. It is kept simple in order to be able to focus on the queries without having to deal with complex JSON objects at the same time.

First, a simple table is created that contains a JSON column. Next, some rows containing JSON objects are inserted.

DROP TABLE demo;

CREATE TABLE demo
( id NUMBER,
  player CLOB 
    CONSTRAINT player_ensure_json 
      CHECK (player IS JSON (STRICT WITH UNIQUE KEYS)));

INSERT INTO demo 
VALUES (1, '{"person": "Bob", "score": 10}');

INSERT INTO demo 
VALUES (2, '{"person": "Bob", "score": 20}');

INSERT INTO demo 
VALUES (3, '{"person": "Jake", "score": 100}');

INSERT INTO demo 
VALUES (4, '{"person": "Jake", "score": 200}');

INSERT INTO demo 
VALUES (5, '{"person": "Alice", "score": 1000}');

With the sample data set in place, we can now construct a complex query in several steps.

Selection and Projection

The most basic query selecting the complete data set is

SELECT * FROM demo d;

A basic projection extracting only the person from the JSON objects is

SELECT d.player.person FROM demo d;

A basic selection restricting the JSON objects is

SELECT d.player.person
FROM demo d
WHERE d.player.person IN ('Jake', 'Bob');

The syntax for accessing properties in JSON objects is in principle

<table alias>.<JSON column>.<path to JSON object key>

with variations on JSON array index references if required (http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246).

A more complex selection with an additional restriction is

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');

Ordering

Results can be ordered, for example, in the following way

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
AND d.player.person IN ('Jake', 'Bob');
ORDER BY d.player.person DESC;

Grouping

Results can be grouped also as a preparation for aggregation

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Aggregation

Different aggregation functions can be used to do some basic analysis

SELECT d.player.person,
  SUM(d.player.score),
  AVG(d.player.score),
  MIN(d.player.score),
  COUNT(*)
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Final Result

The final result is show here in table representation (copied from SQLDeveloper)

Inspiration

This example was inspired, in fact, by http://www.querymongo.com. There, the MySQL Query

is translated to one of MongoDB’s query interfaces to

(web site accessed on 10/21/2014).

Summary

In summary, SQL functionality is available not only for the relational model in the Oracle Database 12c, but also for JSON-based data.

This makes the Oracle database a quite powerful JSON processing environment as querying JSON data is possible through the declarative SQL language.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 1): Introduction to Native Support

September 30, 2014 by chbussler

The Oracle Database 12c Release 1 (12.1.0.2.0) [http://www.oracle.com/technetwork/database/enterprise-edition/overview/index.html] introduces native JSON support and SQL access to JSON. This blog post gives a first introduction.

SQL access to JSON: Native Support

SQL access to JSON is a significant development in itself, but native support in context of a relational database it is actually quite huge and exciting. This blog post series will provide a discussion of the various aspects over several installments, including the mixed use of SQL on JSON with SQL on relational tables.

But first things first.

2-Second Overview

This is a 2-second overview showing how to create a table that can store JSON data, how to insert a row containing a JSON object and how to query it with a simple query.

CREATE TABLE supplier
( id NUMBER NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON));

INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');

SELECT * FROM supplier;
        ID SUPPLIER_DOC
---------- -------------------------------------------------
       125 {supplierId: 125, "supplierName": "FastSupplier"}

That was easy 🙂

Creating a Table storing JSON

JSON data are stored in columns of regular tables. A constraint placed on a JSON column enforces JSON compliance:

CREATE TABLE supplier
( id NUMBER NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON));

Any attempt to insert invalid JSON data fails because of this constraint. Other columns are regular relational columns and they can be defined and constrained as necessary.

A JSON column stores JSON objects as well as JSON arrays as both are valid top-level JSON structures. Trying to insert scalars will fail. The following two insert statements are valid:

INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');

INSERT INTO supplier
VALUES (128,
'["empty_list_of_supplier"]');

While the JSON [json.org] standard allows duplicate members (aka, keys), many implementations only tolerate or even outright reject it. To avoid storing JSON objects with duplicate keys, the constraint of a JSON column can be extended:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (WITH UNIQUE KEYS)));

The following insert will fail with the additional constraint “WITH UNIQUE KEYS” (because of a duplicate keys), but would succeed otherwise.

INSERT INTO supplier
VALUES (126,
'{
  "supplierId": 126,
  "supplierName": "FastSupplier",
  "supplierName": "FS"}');

JSON has a defined syntax, however, many implementations relax it by e.g. allowing to state member names without quotes. To enforce a strict syntax, the constraint on a JSON table can be extended:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (STRICT WITH UNIQUE KEYS))

The following insert will fail (because one key is not enclosed in quotes), but succeed without the “STRICT” constraint:

INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');

With the various constraints and their combinations it is possible to restrict the flavor of JSON that is stored in the database. From an architecture perspective this means that the database can be the central point of enforcing conformity.

Since JSON data is stored in colums it is possible to create several columns in a table that contain JSON data. Here is an example of a table with two columns:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (STRICT WITH UNIQUE KEYS)),
  history_doc CLOB
    CONSTRAINT history_doc_ensure_json 
      CHECK (history_doc IS JSON (STRICT WITH UNIQUE KEYS)));

From a data modeling perspective this opens up a whole new dimension. A brief discussion follows later in this blog post.

Inserting JSON into a Table

Inserting JSON data into a table that has at least one JSON column is rather straight forward, as the previous examples have shown.

The insert statement must have a valid JSON object or JSON array in the position of the JSON column(s), or SQL NULL as the value of a JSON column can be unknown.

Querying JSON with SQL

Oracle Database 12c supports a number of functions to query and to manipulate JSON data. In the following only a first impression is given and the full set will be discussed in additional separate blog posts.

The most basic query was already introduced:

SELECT * FROM supplier;

Selecting scalars from JSON:

SELECT s.supplier_doc.supplierName FROM supplier s;

The select clause is what one would expect: the table name followed by the column name followed by the key name using dot notation. The return value is a table with a single column containing strings.

This query selects scalar value across different columns:

SELECT s.id, s.supplier_doc.supplierName FROM supplier s;

In order to be able to construct more interesting queries, a more complex JSON object is used (see the end of this blog post for its details – it is not included here as it is quite large).

Querying a JSON object looks like this:

SELECT s.supplier_doc.businessAddress FROM supplier s;

This returns a table with one column containing a string representing a JSON object. For those objects that do not have a ‘businessAddress’, a SQL null is returned.

Querying a JSON array is done like this:

select s.supplier_doc.shippers from supplier s;

Like previously, the dot-notation path leads to the JSON member that contains an array as value.

The queries included so far give a first impression of how to query JSON data in Oracle 12c. Upcoming blog posts will provide much more details on the query capabilities, amongst other discussions.

Table Design

With the introduction of JSON it is now possible to have a data model design that combines relational and JSON data modeling instead of having just a relational data model or just a JSON data model. A single table supports the combination of relational and JSON data types.

Some important design questions are:

Should top level scalar data items be separate columns, or top-level keys in a JSON object, or both?
- In the above examples, a primary key column ‘id’ was designed, as well as a key ‘supplierId’ in the JSON object. The ‘id’ column enforces a primary key and supports access to the supplier identifier without accessing the JSON document. The JSON document, however, contains the supplier identifier to be self-contained.
Should all data be in one JSON document or is it more appropriate to have separate JSON objects in different columns?
- In one of the above table creation statements two JSON columns where defined, one containing the supplier data, and another one containing history about the supplier. This supports the separation of data that should not be combined, e.g., for security or privacy reasons. An internal supplier history is not part of the operational supplier data and should be kept separate.

More design principles and best practices will emerge over time in this context in addition to the ones mentioned here so far.

Documentation

The documentation can be found here: [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246]

Installation Notes

The Oracle Database 12c Release 1 (12.1.0.2.0) is available on Windows as well as Linux and Solaris at the time of this blog post [http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html].

In case you want to run the Linux version on Windows, install Virtual Box (version 4.3.15 build 95923 (!) [https://forums.virtualbox.org/viewtopic.php?f=6&t=62615]) and then create a virtual machine with Oracle Linux 6. Once the VM is up and running, install the Oracle Database 12c Release 1 (12.1.0.2.0) in the VM and you are ready to go.

Example Large JSON Object

INSERT INTO supplier
VALUES (123,
'{
  "supplierId": 123",
  "supplierName": "FastSupplier",
  "rating": 5,
  "shippers": [{
    "shipperName": "TopSpeed",
    "address": {
      "street": "Sunrise",
      "streeNumber": 17,
      "city": "Sun City",
      "state": "CA",
      "zip": 12347
      }
    },
    {
    "shipperName": "PerfectPack",
    "address": {
      "street": "Solid Road",
      "streeNumber": 1771,
      "city": "Granite City",
      "state": "CA",
      "zip": 12348
      }
    },
    {
    "shipperName": "EconomicWay",
    "address": {
    "street": "Narrow Bridge",
    "streeNumber": 1999,
    "city": "Central City",
    "state": "CA",
    "zip": 12345
    }
  }],
  "businessAddress": {
    "street": "Main Street",
    "streeNumber": 25,
    "city": "Central City",
    "state": "CA",
    "zip": 12345
  },
  "warehouseAddress": {
    "street": "Industrial Street",
    "streeNumber": 2552, 
    "city": "Central City",
    "state": "CA",
    "zip": 12346
    }
  }');

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Trending: Multi-Interface and Multi-Data-Model Databases

June 30, 2014 by chbussler

An interesting development, especially in the NoSQL database space, is the development towards multi-interface and multi-data-model databases, and sometimes both at the same time. While it provides flexibility, it also brings challenges.

Multi-Data-Model Support

In the relational database space, supporting different data models concurrently is not a novelty. Relational databases started off with the relational data model implementation, and later on some of the systems extended the relational model mainly by objects, XML or JSON.

Some databases in the NoSQL space are starting to evolve, too, in this manner and are providing more than one data model concurrently. Some interesting examples are discussed next.

One example in the NoSQL space is Oracle NoSQL [http://www.oracle.com/technetwork/database/database-technologies/nosqldb/]. This system supports a key/value data model whereby the key is used to identify values that are not interpreted by the database itself. In addition, values can be of complex types that are actually interpreted by the database, e.g., in secondary indexes.

Aerospike is another example in the NoSQL space [http://www.aerospike.com/]. Aerospike provides a data model consisting of basic and complex types. In addition, it supports language-native formats as well as large data types that have a specific operational characteristics and data type operations tuned for scale.

Like some relational databases extended data models over time to support specific use cases in a more direct way, some NoSQL databases are also going down that path to more directly support specific application developer needs.

Multi-Interface Support

From an application development perspective a single query API is certainly preferable that provides the complete query expressiveness required. However, especially in the new area of NoSQL databases, it is not clear yet what a good query API actually looks like. This can be observed by different systems providing different query API alternatives.

MongoDB [http://www.mongodb.org/] has a document query interface based on query patterns in form of JSON documents (“Query Documents”). In addition, it provides a map/reduce interface and aggregation pipelines. There are three different APIs that an application developer can choose, and, in addition, they overlap in their functionality. This means that, depending on the query, it can be expressed in all three of them.

Aerospike [http://www.aerospike.com/] provides different language drivers in addition to an Aerospike Query Language.

Cloudant [http://www.cloudant.com], in contrast, supports a REST-api as well as a query interface based on query documents (similar to MongoDB).

Not strictly an external interface, but very important for specific use cases, is the ability to add functionality dynamically to the database in order to move some processing from the application systems into the database itself: user defined functions. For example, MongoDB allows adding functionality through JavaScript functions, whereas Aerospike supports two different types of Lua functions: record user defined functions operate on single records, whereby stream user defined functions support distributed processing.

The Good

Unquestionably, the good part about multi-interface and multi-data-model databases is that an application developer can choose the best combination of data model and access interface for a particular development task. The impedance mismatch between the problem and the solution can be minimized with an appropriate choice.

This also means that developers need to understand the pros and cons of every combination and that requires to go through a learning curve. Going through that learning curve might pay off big time.

In addition, application development teams will have to manage a wider range of implementation variations in terms of application design and engineering, but also in terms of bug fixing and application code maintenance.

The Tricky

The tricky part of multi-interface and multi-data-model databases is that all combinations can be used concurrently, in production as well as post-production (e.g. analytics). Unit and functional tests as well as performance and scale tests become a lot more complicated as they have to test the concurrent execution of various combinations.

Furthermore, many queries can be expressed in different interfaces as those tend to overlap in query expressiveness. So an application developer needs to clearly understand the pros and cons of the query execution that underlies a specific query interface.

Hopefully the query semantics is the same for all combinations (meaning, for example, that predicates are evaluated the same way) and that concurrent use of data models and interfaces does not negatively impact the various clients in terms of concurrency, scale and performance. Any bug introduced through a discrepancy might be very difficult to reproduce and fix.

Summary

While multi-interface and multi-data-model databases are a powerful technology, there is considerable impact to the application system engineering activities in terms of knowledge acquisition, development, test and maintenance.

While database vendors certainly strive to have all combinations work in harmony, there might be edge cases where one combination does not give the same result as a different one. From an application development perspective test coverage should ensure semantic equivalence of the used combinations so that misinterpretations or wrong results are avoided.

Document-oriented NoSQL Databases: How many Joins will you have to implement?

April 30, 2014 by chbussler

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

“joins suck”, “join within your application”, “lack of joins”: http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/
“what’s missing … is a SQL-style join operation”: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
“doesn’t use SQL or JOINs, so it’s high-performance.”: http://www.mongodb-is-web-scale.com/ (on the lighter side)

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

Find all parts supplied by a supplier with a given name
Find all parts shipped by a shipper with a given name
Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}

{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}

{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}

{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

Joins: (Almost) Impossible to Avoid in Document-oriented Databases

July 12, 2012 by chbussler

There is a lot of ‘chatter’ about the concept and support of joins in document-oriented databases. So what is the underling issue?

Joins in RDBMS

‘Join’ in the relational world is an operation on two relations that relates the tuples in these relations with each other based on some comparison criteria on the tuples’ attributes. For example, the comparison can be ‘R1.a = R2.b’ and so for each tuple from the first relation R1 all tuples from the second relation R2 are retrieved and combined that match the comparison, meaning, the attribute ‘a’ must match the attribute ‘b’. A detailed discussion can be found here: http://en.wikipedia.org/wiki/Join_%28SQL%29.

Joins allow to relate data from different relations and the join operator is supported by a relational database management system. A typical use case is to find all parts that a supplier supplies. And, for a given part, find all its suppliers. The suppliers and parts are usually stored in different relations and the data have an m:n relationship with each other.

Joins across Documents?

So why the chatter, then? If a document-oriented database stores data in different document collections and if the documents need to be related to each other, then a join is in order. The example of suppliers and parts applies here in the exact same way.

Now, if a document-oriented database does not support joins, what to do? Well, in reality the join will be performed in some layer above the database in a programming language. If all suppliers have to be displayed for a given part, then a program that computes this result effectively implements a join; it is not done in the database, though.

Pre-joined data in Documents?

Some optimization is possible. If the access pattern follows an 80-20 rule, then document-oriented databases allow some hard-coded optimization. If in 80% of the cases the suppliers for a part are requested, and only in 20% the opposite, then the designer of the document layout could create for each part document a sub-collection ‘supplier’ that contains the suppliers of this part. In 80% of the cases no join is necessary any more as the suppliers are ‘pre-joined’ with the parts they supply, only in the 20% of the cases a join is necessary.

However, this causes what in the relational world is called anomalies: If a supplier is removed, then all part documents have to be searched for this supplier. Or if a supplier is added, then all those part documents have to be updated that are supplied by this supplier. Updating supplier data also requires to search the part documents. Pre-joining is effectively a specific de-normalization activity for performance reasons.

Does the type of relationship matter?

Are there relationships that by their nature can be pre-joined without penalty? A very specific relationship, the part-of relationship, falls into this category. It is a ‘clean’ approach since the life time of the part-of objects are the exact same as the containing object.

Another relationship that feels as if pre-joining makes sense is the 1:1 relationship where two objects are exclusively related to each other. However, this is not really the case as one object would be a property of the other and that then could be done the other way around, too. So the 80-20 rule case applies here, too.

In reality, however, relationship between data are usually a lot more complex then just part-of relationships. This in turn means that joins will be necessary. The only real exception is if the 80-20 rule is really a 100-0 rule. This would mean that all access are the exact same and no joins are necessary.

Underlying Conceptual Foundation

Conceptually as soon as independent entities (i.e. objects in their own right) are related to each other, and if their relationship is traversed in both directions at some point in time during the execution of the application, a join is necessary and factually taking place.

Pre-joining is the materialization of the traversal in one direction. So two pre-joins, one for each direction, are possible. If the pre-join in both directions takes place, no join has to be performed upon retrieval; however, the join functionality was applied at time of update or insert in order to accomplish the pre-joins.

As soon as pre-joins exist, possible update, insert and delete anomalies have to be carefully taken care of as pre-joins are the equivalent to de-normalization and therefore data redundancy. At insert, update and delete time all redundant copies of the objects have to be found and the appropriate functionality applied.

Pre-joins are for read-performance reasons only; they are not a conceptual matter and in fact cause additional work at insert, update or delete time instead; so the computational work shifted, but is not avoided.

Note

‘Join’ is a database operator. The same functionality can be implemented in application code outside or ‘on top’ of the database. Most likely the method or function is not called ‘join’ even though it in fact implements that functionality. So be aware of the situation that a document-oriented database does not implement a join and the engineers claim not to need one. The functionality of a join might just be there under a different name.

Relational Database Management System (RDBMS)

April 25, 2012 by chbussler

Why a blog on Relational Database Management System (RDBMS)? Isn’t there enough said about this type of database? Yes, there is. This post is to call out a few aspects of the relational model that I want to refer to in context of RDBMSs for subsequent discussions. In addition, it re-establishes terminology.

Normalization

Briefly, normalization attempts to remove data duplication and data dependencies so that an update of a data item has to be done only once for the whole data set to be consistent again (http://en.wikipedia.org/wiki/Database_normalization).

Universal Relation

The Universal Relation is mostly an assumption that states: it is possible to store all data in one big, wide table. For a discussion, see Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press 1989, ISBN 0-7167-8162-X.

In a way, the universal relation is the ‘opposite’ of a perfectly normalized relational schema. It also assumes that the semantically same attributes are named the same way in different relations. As a consequence, many columns might have null values. But for the thought experiment, this is perfectly valid.

Part-Of Relationship

The part-of relationship consists between an owning object and an owned one whereby the owned object’s life cycle is tightly bound to the owning one: if the owning object is deleted, the owned one will be deleted, too. The owned object exists in context of the owning object and does not have a life on its own.

In the relational model there is a choice how this can be modeled. In principle, the owned objects can be in their own relation, separate from the owning objects in their own relation. In addition, there is a foreign key relationship from the owning to the owned relationship.

Alternatively, the owned objects can be in the same relation as the owning objects. Both would be in the same row. This is basically the case when the owned object is a scalar domain value. In this case it is almost automatically in the same relation as the owning data.

De-Normalization

De-normalization is about combining data that would otherwise be in separate relations when being fully normalized (http://en.wikipedia.org/wiki/Denormalization). This introduces redundancy and dependencies in the model for the sake of access improvement and access speed.

In a sense, the universal relation is the ultimate form of de-normalization.

Schema

In the relational database management system world the data are stored as rows in tables. A table provides a fixed structure whereby every column defines one and only one data type. Any row stored in a table, therefore, must comply to the types of the columns. The only exception is ‘null’ since ‘null’ can be of any data type. The significance is that tables are well-defined in terms of data types and rows have to comply to the table definition. This makes it a strictly typed system: the set of all table definitions constitute the schema.

Schema changes are possible in most RDBMS implementations. It is possible to add, to change or to delete columns. One consequence is that all rows in the table have to change so that their values comply to the new column definitions. Adding and removing tables is possible, too. The ‘hard’ part about changing a table is that all rows have to be changed accordingly; it is not possible to only have the change apply to future rows or when rows are updated.

Summary

This blog recalled a few important concepts from the relational model and relational database management systems. These will be referred to and used in future blogs.