Oracle 12c – In-Memory Option (Part 3): JSON Support

It is quite natural to view Oracle’s In-Memory Option in context of the relational model; however, the In-Memory Option supports the JSON model at the same time as well.

Oracle JSON Support

As shown earlier in this blog, the Oracle 12c database supports JSON natively [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] and incorporates JSON access into SQL so that JSON structures can be accessed through SQL directly. In addition, in a single SQL statement, JSON structures as well as relational tables can be accessed at the same time. Please see the documentation and the other blog entries for more details.

Oracle In-Memory Option

The In-Memory Option [http://www.oracle.com/us/corporate/features/database-in-memory-option/] was introduced in the previous two blog entries and those discussed a relational example as well as queries that allow to introspect the meta data and status of the In-Memory Option processing. It gave an overview as well as the rationale why using the In-Memory Option for analytic queries is advantageous over regular relational processing (columnar representation of data in main memory).

Combination: JSON processing using In-Memory Option

Analytics in context of JSON structures would benefit from In-Memory Option support as well. Many applications based on JSON structures usually have to support some form of analytics. The In-Memory Option supports aggregation functionality on JSON structures as well as on relational tables.

In the following, an example is introduced that creates a table containing a JSON structure, enables it for In-Memory Option support and shows a few simple analytics queries.

Example Table with JSON Structure

The following declares a sample table with a column containing a JSON structure.

DROP TABLE js_players;
CREATE TABLE js_players
 (
   player_id NUMBER NOT NULL,
   player_name VARCHAR(255),
   games VARCHAR(4000) 
    CONSTRAINT games_ensure_json
    CHECK (games IS JSON (STRICT WITH UNIQUE KEYS)));

Table Population

In order to have a large data set the following block creates rows containing JSON structures and inserting them into the above declared table.

DECLARE
  counter NUMBER;
  player_name VARCHAR(128);
  games_value VARCHAR (256);
BEGIN
  counter := 0;
  LOOP
    counter := counter + 1;
    player_name := 'Divvon' || '_' || counter;
    games_value := '{ "games":
      [{"name":"tic-tac-toe", 
        "points":' || counter *10 || '},
       {"name":"one-two-three", 
        "points":' || counter * 100 || '}
      ]}';
    INSERT INTO js_players VALUES
      (counter, player_name, games_value);
  EXIT
    WHEN counter = 1000000;
  END LOOP;
  COMMIT;
  RETURN;
END;

An example JSON document looks like this:

{"games":
  [{"name":"tic-tac-toe", "points":750},
   {"name":"one-two-three", "points":7500}
  ]}

Enabling In-Memory Option

This statement enables the In-Memory Option for the table containing JSON structures:

ALTER TABLE js_players inmemory priority critical;

Aggregation Queries

The following queries show aggregation over elements of the JSON structures. Each query extracts all ‘points’ values from the JSON structure (since every document might have several ‘points’ fields in the array of ‘games’) and makes is accessible as ‘pts.points’. Then it aggregates over this structure. The first query aggregates over all rows, whereas the second query is selecting only a few rows based on ‘player_name’.

SELECT DISTINCT p.player_name,
       SUM(pts.points) sum_points,
       MIN(pts.points) min_points,
       MAX(pts.points) max_points,
       AVG(pts.points) avg_points,
       COUNT(*)
FROM js_players p,
     json_table(p.games, 
                '$.games[*]' COLUMNS (points VARCHAR2(32 CHAR) PATH '$.points')) 
         AS pts
GROUP BY p.player_name
ORDER BY sum_points DESC;

01_blog

SELECT DISTINCT p.player_name,
       SUM(pts.points) sum_points,
       MIN(pts.points) min_points,
       MAX(pts.points) max_points,
       AVG(pts.points) avg_points,
       COUNT(*)
FROM js_players p,
     json_table(p.games, 
                '$.games[*]' COLUMNS (points VARCHAR2(32 CHAR) PATH '$.points')) 
       AS pts
WHERE p.player_name = 'Divvon_204'
      OR p.player_name = 'Divvon_206'
GROUP BY p.player_name
ORDER BY sum_points DESC;

02_blog

Summary

Oracle’s In-Memory Option, because it not only supports the relational model, but also the JSON model, is an interesting alternative to analytics in context of JSON data sets. This blog has shown an example of how to combine JSON structures and the In-Memory Option in order to be able to run analytics queries.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisement

Oracle 12c – In-Memory Option (Part 2)

Part 1 was introducing an Oracle In-Memory Option example; in Part 2 I’ll introduce a few queries that show meta data and statistics in context of the Oracle In-Memory Option.

Table Status

One interesting set of metadata is the table’s in-memory status as well as the compression and loading priority. A table with enabled in-memory option is called “in-memory table” in the following for short. The following query retrieves these metadata for the in-memory table “IM_PLAYERS”:

SELECT table_name,
       inmemory,
       inmemory_compression,
       inmemory_priority
FROM dba_tables
WHERE table_name = 'IM_PLAYERS';

01

Loading of in-memory tables takes some time on initial load and their load status is important. Only when loaded the full capability of the in-memory functionality is available. The following query shows the load status by retrieving the number of bytes not populated into the in-memory area yet:

SELECT v.segment_name name,
       v.populate_status populate_status,
       v.bytes_not_populated bytes_not_populated
FROM v$im_segments v
ORDER BY 1;

02

The size of an in-memory table is important in order to determine if the memory allocation for the in-memory option is sufficient. The following query fetches various sizes:

SELECT v.owner,
       v.segment_name,
       v.populate_status,
       ROUND(v.bytes / 1000 / 1000, 2) 
           AS size_in_mb,
       ROUND(v.inmemory_size / 1000 / 1000, 2) 
           AS inmemory_size_in_mb,
       ROUND(v.inmemory_size / v.bytes, 2) * 100 
           AS im_size_percent
FROM v$im_segments v;

03

Basic Usage

When running queries against in-memory tables the in-memory option might or might not be triggered and used depending on the query optimizer’s decisions. The following two statements executed together show if the in-memory option was used:

EXEC dbms_feature_usage_internal.
         exec_db_usage_sampling(SYSDATE);
SELECT ul.name,
       ul.detected_usages
FROM dba_feature_usage_statistics ul
WHERE ul.version =
          (SELECT MAX(u2.version)
           FROM dba_feature_usage_statistics u2
           WHERE u2.name = ul.name
                 AND ul.name LIKE 'In-%');

04

A query to an in-memory table advances the value of detected_usages if the in-memory representation of that table was in fact accessed.

Statistics

If a query triggers the use of the columnar representation managed by the in-memory option it is important to understand the specific statistics to see how much optimization took place. The following query retrieves some of all available statistics:

SELECT t.display_name,
       s.value,
       s.sid,
       t.statistic#
FROM v$statname t, v$mystat s
WHERE t.display_name IN ('IM scan CUs columns accessed', 
                         'IM scan segments minmax eligible', 
                         'IM scan CUs pruned')
      AND s.statistic# = t.statistic#
ORDER BY s.sid DESC;

05

Being able to retrieve the statistics allows now to see the statistics before and after query execution. In the next section a selective as well as aggregate query is run and the statistics examined before and afterwards.

Example

The first example is a selective query that retrieves all columns of two players (rows).

SELECT * 
FROM cb.im_players 
WHERE player_id = 10 OR player_id = 20;

06

First, the statistics is shown before the query is executed and then the statistics after the query is executed.

05

07

The second example is a non-selective query that aggregates over all rows.

SELECT player_name,
       COUNT(*),
       MIN(player_score),
       MAX(player_score),
       AVG(player_score),
       SUM(player_score)
FROM cb.im_players
GROUP BY player_name;

08

Again, first the statistics before and after the query execution is shown.

07

09

Summary

This blog introduced a series of queries that allow to retrieve metadata as well as statistics on query execution in context of the Oracle In-Memory Option. The nice part is that these are regular SQL queries that can be run manually, but also as part of development or operations tools. It is therefore possible to examine the use of the in-memory tables and their behavior in a running application.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – In-Memory Option (Part 1)

The name ‘In-Memory Option’ does not mean Oracle turned into a general main memory database or has a new option to do so; it means something else entirely.

Aggregation

Columnar data representation is geared towards aggregation of columns in relational databases. In order to make aggregation fast, the rows of a column should be ‘near’ to each other in terms of memory layout; the best would be if the rows of the column would be next to each other. Access for aggregation is very efficient with this layout. A database implementing this approach is called a columnar database (http://en.wikipedia.org/wiki/Column-oriented_DBMS).

Specialized databases exist that structure the data on disk as well as in memory in a columnar format. Once the data is in such a database, aggregation can be fast. However, the data has to be in the database and chances are that the same data is in a separate OLTP database also (a second copy). This then means that a data integration problem has to be solved in order to ensure that data that is managed in the OLTP database is also available (with sufficiently minor delay) in the columnar database (or columnar store). And, since it changes in the OLTP database, the integration effort is an ongoing process.

In-Memory Option

Oracle 12c’s In-Memory Option [http://www.oracle.com/us/corporate/features/database-in-memory-option/] combines the columnar storage with OLTP storage in the same database. The columnar storage layout is only implemented in main memory, leaving the disk layout as it is. This means that OLTP and columnar access is possible in the same database, concurrently, and the data is always 100% transactionally consistent, as no delayed data integration has to take place. The ‘artificial’ separation of database functionality into separate databases to satisfy aggregation processing requirements is not necessary anymore in such a database management system.

Example

Since Oracle 12c combines the relational and the columnar representation, using the columnar functionality is fully declarative without any operational management overhead.

In the following a small example shows how to make use of the Oracle 12c In-Memory Option concurrently to the regular OLTP load.

Assume a regular table that has been in place for a while and is accessed by a regular application (OLTP use case). This table manages data and is changed according to the application logic.

DROP TABLE im_players;
CREATE TABLE im_players
  (
   player_id NUMBER,
   player_name VARCHAR2(128),
   player_score NUMBER);

Here is a way to populate the table with random content for testing purposes:

DECLARE
  player_id NUMBER;
  player_score NUMBER;
  player_name VARCHAR(128);
BEGIN
  player_id := 0;
  LOOP
    player_id := player_id + 1;
    player_name := 'player_' || player_id;
    player_score := player_id * 10;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
    player_score := player_id * 100;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
    player_score := player_id * 1000;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
  EXIT
    WHEN player_id = 1000000;
  END LOOP;
  COMMIT;
  RETURN;
END;

A new requirement has to be fulfilled based on business needs. The requirement is to analyze the existing table regularly while it is being used by the OLTP application. In the following this requirement is implemented with the Oracle 12c In-Memory Option.

First, memory has to be set aside for the columnar representation of the data in main memory:

ALTER system SET inmemory_size = 300M scope=spfile;

Restarting the database enables this change and this query confirms the setting:

SELECT name, value FROM v$sga;

01

Now the database has a main memory area (In-Memory Area) set aside for the columnar representation of relational tables.

Next, the existing table definition is modified to additionally be represented in columnar representation:

ALTER TABLE im_players inmemory priority critical;

There are different options that indicate how critical it is to have the rows in columnar format, how to compress depending on the access model, and actually which columns should be in columnar format as not all columns are necessarily required.

The alter statement above indicates that it is critical for the data to be in columnar format. The database will diligently load the data into the columnar memory section. Compression is enabled for low query access. All columns are enabled, based on the use case.

The following query shows how much data is not yet populated. Loading data is a process that takes time and the query supports checking the load status. Once the loading is completed, zero bytes are not yet populated.

SELECT 
  v.segment_name name,
  v.populate_status populate_status,
  v.bytes_not_populated bytes_not_populated
FROM v$im_segments v
ORDER BY 1;

02

The following query performs an aggregation.

SELECT
  player_name,
  COUNT(*),
  MIN(player_score),
  MAX(player_score),
  AVG(player_score),
  SUM(player_score)
FROM im_players
GROUP BY player_name;

??03

03

The query plan of this query shows that the columnar representation is being accessed.

----------------------------------------------------------------------------------------
| Id | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT           |            | 3025K | 227M  | 216 (38)   | 00:00:01 |
|  1 | HASH GROUP BY              |            | 3025K | 227M  | 216 (38)   | 00:00:01 |
|  2 | TABLE ACCESS INMEMORY FULL | IM_PLAYERS | 3025K | 227M  | 141 (5)    | 00:00:01 |
----------------------------------------------------------------------------------------

And this query shows access metrics indicating that the query actually executed against the columnar representation. The values after executing the aggregation query are different from the values before the aggregation.

Note on SQL

An important aspect of the Oracle 12c In-Memory Option is that the SQL specifying the required aggregation is regular SQL and no new or different language constructs had to be introduced. The underlying reason is that the query optimizer has sufficient knowledge to make the appropriate access choices based on the SQL presented and does not require hints from users. This is a huge engineering benefit since no new or different language has to be learned to make use of the columnar representation.

Summary

This was a super-brief introduction into the Oracle 12c In-Memory Option. The documentation has of course a lot more background [http://docs.oracle.com/database/121/ADMIN/memory.htm#ADMIN14257]. In addition, a nice blog series exists that introduces the various aspects step-by-step [https://blogs.oracle.com/In-Memory/entry/getting_started_with_oracle_database].

From a data management aspect an optimized representation targeted towards aggregation is enabled declaratively in context of the same database that holds the OLTP data. This is a huge benefit as one database supports the mixed workload and any data integration setup between different databases becomes unnecessary. With significantly increasing main memory sizes this option allows freeing up engineering cycles as well as operational resources while increasing data consistency and processing performance.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 7): Creating JSON from Relational Data

Up to now the examples of previous blogs inserted JSON data into JSON enabled columns of tables. This blog shows how JSON data can be created or derived automatically, and, more importantly, how relational data can be transformed and represented automatically as hierarchical JSON data.

Virtual Columns

A virtual column of a table in Oracle is defined by referring to a function invocation. The value of the column is the function invocation result. Parameters (column values of the same table) can be passed to the function and so the functionality of the function can be dependent on the data content of the row.

A simple example is shown in the following. A virtual column is created that contains a name – value pair based on a non-virtual column of the same table (the function is string concatenation).

DROP TABLE vc;
CREATE TABLE vc
  (
   id NUMBER NOT NULL,
   name VARCHAR(255),
   value generated always AS (
     '"name"' || ':' || '"' || name || '"' ));
INSERT INTO vc (id, name) VALUES (1, 'abc');
INSERT INTO vc (id, name) VALUES (2, 'def');
SELECT * FROM vc;

1

Value Column with JSON Structure

The following example shows a similar example, however, in this case the virtual column is a JSON column and the contents is a JSON object containing the name – value pair. The table definition ensures that the column is a JSON column and that it can be accessed with the JSON functions.

DROP TABLE vc;
CREATE TABLE vc
 (
   id NUMBER NOT NULL,
   name VARCHAR(255),
   value generated always AS (
     '{' || '"name"' || ':' || '"' || name || '"' || '}') 
     virtual CONSTRAINT value_ensure_json 
     CHECK (value IS JSON (STRICT WITH UNIQUE KEYS)) );
INSERT INTO vc (id, name) VALUES (1, 'abc');
INSERT INTO vc (id, name) VALUES (2, 'def');
SELECT * FROM vc;

2

Separate Function creating JSON Structure

Functions can be created separately as first class citizens in the database (for the purpose of proper engineering practices as well as functionality reuse); they do not necessarily have to be inline invocations. The following example creates a separate function that creates a JSON object containing a name – value pair.

CREATE OR REPLACE FUNCTION 
  to_JSON(in_name IN VARCHAR2,
          in_value IN VARCHAR2)
  RETURN VARCHAR2 deterministic
  IS
    JSON_value VARCHAR(255);
  BEGIN
    JSON_value := 
      '{' || '"' || in_name || '"' || ':' || 
             '"' || in_value || '"' || '}';
  RETURN JSON_value;
END;
DROP TABLE vc;
CREATE TABLE vc
 (
   id NUMBER NOT NULL,
   name VARCHAR(255),
   value generated always AS (
     to_JSON('name', name)) 
     virtual CONSTRAINT value_ensure_json 
     CHECK (value IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO vc (id, name) VALUES (1, 'abc');
INSERT INTO vc (id, name) VALUES (2, 'def');
SELECT * FROM vc;

3

Function with Selection

Functions have the full range of SQL available and can execute queries in their implementation. The following example queries a second table and creates a JSON object from the result. Note that the query depends on the input parameter. 

CREATE OR REPLACE FUNCTION 
  from_dep(in_value IN VARCHAR2)
  RETURN VARCHAR2 deterministic
IS
  JSON_value VARCHAR(255);
  dep_value  VARCHAR(255);
BEGIN
  EXECUTE IMMEDIATE 'SELECT ' || ' adr ' || 
                    ' FROM ' || ' dep ' 
                    INTO dep_value;
  JSON_value := '{' || '"' || in_value || '"' || ':' || 
                       '"' || dep_value || '"' || '}';
  RETURN JSON_value;
END;
DROP TABLE vc;
CREATE TABLE vc
 (
   id NUMBER NOT NULL,
   name VARCHAR(255),
   value generated always AS 
     (from_dep(name)) 
     virtual CONSTRAINT value_ensure_json 
     CHECK (value IS JSON (STRICT WITH UNIQUE KEYS)) );
DROP TABLE dep;
CREATE TABLE dep
 (
    adr VARCHAR(255));
INSERT INTO dep VALUES ('yahoo');
INSERT INTO vc (id, name) VALUES (1, 'abc');
INSERT INTO vc (id, name) VALUES (2, 'def');
SELECT * FROM vc;

4

Hierarchical JSON Column

The last example shows how a function can create complex hierarchical JSON structures. This allows for denormalizing a relational model into a JSON document representation as needed. This is the use case where a client requires a self-contained JSON object that contains all necessary data so that the client does not have to perform repeated queries to obtain all data.

CREATE OR REPLACE FUNCTION 
  from_dep(v_id IN NUMBER)
  RETURN VARCHAR2 deterministic
IS
  JSON_value VARCHAR(255);
  dep_value  VARCHAR(255);
  CURSOR dep_cur
    IS
      SELECT * FROM dep d WHERE d.vc = v_id;
  l_dep_adr dep%ROWTYPE;
  is_first NUMBER;
BEGIN
  is_first := 0;
  JSON_value := '[';
  OPEN dep_cur;
  LOOP
    FETCH dep_cur INTO l_dep_adr;
    EXIT
      WHEN dep_cur%NOTFOUND;
    IF (is_first = 0) 
    THEN
      JSON_value := JSON_value || 
        '{' || '"' || 'adr' || '"' || ':' || 
               '"' || l_dep_adr.adr || '"' || '}';
      is_first := 1;
    ELSE
      JSON_value := JSON_value || ',' || 
        '{' || '"' || 'adr' || '"' || ':' || 
               '"' || l_dep_adr.adr || '"' || '}';
    END IF;
  END LOOP;
  CLOSE dep_cur;
  JSON_value := JSON_value || ']';
  RETURN JSON_value;
END;
DROP TABLE vc;
CREATE TABLE vc
 (
   id NUMBER NOT NULL,
   name VARCHAR(255),
   value generated always AS 
     (from_dep(id)) 
     virtual CONSTRAINT value_ensure_json 
     CHECK (value IS JSON (STRICT WITH UNIQUE KEYS)) );
DROP TABLE dep;
CREATE TABLE dep
 (
   vc NUMBER, adr VARCHAR(255));
INSERT INTO vc (id, name) VALUES (1, 'abc');
INSERT INTO vc (id, name) VALUES (2, 'def');
INSERT INTO dep VALUES (1, 'yahoo');
INSERT INTO dep VALUES (2, 'yahaa');
INSERT INTO dep VALUES (2, 'yahee');
SELECT * FROM vc;

5

Summary

Virtual columns are an interesting mechanism when it comes to denormalizing a relational data model into a hierarchical JSON document model. This blog gave some examples upon which more complex functionality can be built. It shows that Oracle 12c can not only query JSON, but also generate JSON structures from relational data supporting a large array of use cases.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 6): Un-nesting

JSON supports nested structures and the particular shape of the nesting is decided by the application systems or by a data modeler for different reasons. How can nested JSON structures be queried with SQL?

Nested JSON Structures

Nested JSON structures are based on two types: JSON objects and JSON arrays. Objects can have several members; each member is either of scalar type, object type or array type. An array can contain scalar values, objects or arrays. This nesting of objects with objects and arrays provides the mechanism to build (theoretically unlimited deep) nested structures.

Nested structures are used in most cases to model a part-of relationship or an association relationship. As the nested structures are trees, there is a 1:n relationship between the higher and the lower nodes (without cycles as JSON types do not include reference types).

Folklore says that nesting is sufficient to ‘avoid’ joins in context of JSON databases, but professionals know a lot better.

In the following, an example data set is introduced and a few queries afterwards that specifically deal with nested structures in order to query those in context of SQL. A key JSON function in context of Oracle 12c is json_table() [https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] combined with un-nesting.

Example Data Set

DROP TABLE reseller;
CREATE TABLE reseller
  (
    reseller_id   NUMBER NOT NULL,
    reseller_name VARCHAR(255),
    suppliers CLOB CONSTRAINT suppliers_ensure_json 
    CHECK (suppliers IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO reseller VALUES
(1, 'abc corp.',
  '{ "suppliers": [
    {"supplier_name": "Sign Corp", "parts": [
      {"partNo": 1, "price": 100, "quantity": 2},
      {"partNo": 2, "price": 200, "quantity": 4}]},
    {"supplier_name": "Stand Corp.", "parts": [
      {"partNo": 10, "price": 1000, "quantity": 6},
      {"partNo": 11, "price": 1100, "quantity": 8},
      {"partNo": 12, "price": 1200, "quantity": 10}]}]}');
INSERT INTO reseller VALUES
(2, 'xyz corp.',
  '{ "suppliers": [
    {"supplier_name": "Sign Corp", "parts": [
      {"partNo": 1, "price": 100, "quantity": 12},
      {"partNo": 3, "price": 300, "quantity": 14}]},
    {"supplier_name": "Stand Corp.", "parts": [
      {"partNo": 10, "price": 1000, "quantity": 16},
      {"partNo": 11, "price": 1100, "quantity": 18},
      {"partNo": 12, "price": 1200, "quantity": 20},
      {"partNo": 13, "price": 1300, "quantity": 22},
      {"partNo": 14, "price": 1400, "quantity": 24}]}]}');

Simply Query: Listing all Suppliers

The first query is going to list the distinct set of supplier names. Supplier names are part of objects inside arrays. Therefore the query will reach into the tree and iterate over the suppliers (using the [*] notation that says to iterate over all array elements) in order to pick out the supplier name (accomplished by the PATH expression). A column is declared that captures all supplier names and a distinct projection ensures that each supplier name is included only once into the result set.

SELECT DISTINCT s.*
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name')) AS s;

Result:
1
Changing [*] to [1] only retrieves the first supplier and since the first supplier is the same across the data set of JSON documents only one supplier name appears in the result set.

SELECT DISTINCT s.*
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[1]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name')) AS s;

Result:
2

Deeper Reach: Listing all Parts and Prices

The next query lists all parts and their prices. It creates two columns, one for the part number and one for the price. However, since each supplier can supply more than one part, the query has to iterate over two arrays (using the [*] notation), once over suppliers, and once over all parts within each supplier. The result is ordered in order to provide a consistent layout. 

SELECT DISTINCT s.*
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*].parts[*]'
                COLUMNS (part_no NUMBER
                         PATH '$.partNo',
                         part_price NUMBER
                         PATH '$.price')) AS s
ORDER BY s.part_no;

Result:
3
It would be more interesting to see the supplier name also for each of the parts. This is where un-nesting is applied.

Un-nesting: Listing all Parts and Prices for each Supplier and for each Reseller

The next query lists the parts and their prices for each supplier, and for each reseller. This allows to see the reseller, the suppliers they use, as well as the parts that are supplied.

SELECT r.reseller_name,
       s.*
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name',
                         NESTED path '$.parts[*]'
                         COLUMNS (part_no VARCHAR2(32 CHAR)
                                  PATH '$.partNo',
                                  part_price NUMBER
                                  PATH '$.price'))) AS s
ORDER BY r.reseller_name,
         s.supplier_name,
         s.part_no;

Result:
4
Note the keyword NESTED that transforms a hierarchical relationship into a relational representation by repeating the column value on the higher level. In this example, for each part_no and part_price, the supplier_name is provided.

Computation on Un-nested JSON Data

The following query shows the inventory and inventory value for each reseller and supplier. The value is dynamically computed by the query.

SELECT r.reseller_name,
       s.*,
       s.part_price * s.part_quantity AS inventory_value
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name',
                         NESTED path '$.parts[*]'
                         COLUMNS (part_no VARCHAR2(32 CHAR)
                                  PATH '$.partNo',
                                  part_price NUMBER
                                  PATH '$.price',
                                  part_quantity NUMBER
                                  PATH '$.quantity'))) AS s
ORDER BY r.reseller_name,
         s.supplier_name,
         s.part_no;

Result:
5

Aggregation on Un-nested JSON Data

The following query shows the total inventory of each reseller by aggregation in the query.

SELECT r.reseller_name,
       SUM(s.part_price * s.part_quantity) AS inventory_value
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name',
                         NESTED path '$.parts[*]'
                         COLUMNS (part_no VARCHAR2(32 CHAR)
                                  PATH '$.partNo',
                                  part_price NUMBER
                                  PATH '$.price',
                                  part_quantity NUMBER
                                  PATH '$.quantity'))) AS s
GROUP BY r.reseller_name
ORDER BY r.reseller_name;

Result:
6

Changing to Seller Perspective

The following query shows the quantity and sales of each type of part to resellers. The perspective changes from a reseller perspective so far to a seller perspective. In SQL this is straight-forward, however, in context of nested JSON structures this would mean processing of the 1:n relationship in the opposite direction.

SELECT s.supplier_name,
       r.reseller_name,
       s.part_no,
       s.part_price,
       s.part_quantity,
       s.part_price * s.part_quantity AS sales_value
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name',
                         NESTED path '$.parts[*]'
                         COLUMNS (part_no VARCHAR2(32 CHAR)
                                  PATH '$.partNo',
                                  part_price NUMBER
                                  PATH '$.price',
                                  part_quantity NUMBER 
                                  PATH '$.quantity'))) AS s
ORDER BY s.supplier_name,
         r.reseller_name,
         s.part_no;

Result:
7

Aggregation of Seller Data

And, finally, this query shows the total sales by each supplier to each reseller.

SELECT s.supplier_name,
       r.reseller_name,
       SUM(s.part_price * s.part_quantity) AS sales_value
FROM reseller r,
     json_table(r.suppliers,
                '$.suppliers[*]'
                COLUMNS (supplier_name VARCHAR2(32 CHAR)
                         PATH '$.supplier_name',
                         NESTED path '$.parts[*]'
                         COLUMNS (part_no VARCHAR2(32 CHAR)
                                  PATH '$.partNo',
                                  part_price NUMBER
                                  PATH '$.price',
                                  part_quantity NUMBER
                                  PATH '$.quantity'))) AS s
GROUP BY s.supplier_name,
         r.reseller_name
ORDER BY s.supplier_name;

Result:
8

Summary

As shown, the JSON function json_table() supports transforming hierarchically structured data into relationally structured data so that it is possible to execute expressive SQL queries, not only retrieving data, but also aggregating data on different levels of the hierarchy.

Once the hierarchical structure is un-nested, the data can be queried from different perspectives, aka, the 1:n relationship in the nested structure can be processed in both directions.

This is another example of the quite powerful and convenient integration of JSON into the world of relational databases and SQL. From an application development perspective this means that JSON documents can be used and at the same time made accessible to the full power of SQL.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 3): Basic Joins

Having JSON support in a relational database means that the join operator is available not only for joining relational data or JSON data, but also for the mixed case: joining JSON and relational data. This opens up a whole new world of data modeling and query execution.

Running Example

This running example creates three tables, “demo”, “city” and “city_rel” and a sample data set in each table. The tables “city” and “city_rel” contain the same data set, once in JSON format, and once in relational format.

DROP TABLE demo;
CREATE TABLE demo
(
  id NUMBER,
  person CLOB 
    CONSTRAINT person_ensure_json 
    CHECK (person IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO demo VALUES
( 1, '{ "name": "Bob", "city": "SF"}' );
INSERT INTO demo VALUES
( 2, '{ "name": "Jake", "city": "PA"}' );
INSERT INTO demo VALUES
( 3, '{ "name": "Alice", "city": "NYC"}' );
INSERT INTO demo VALUES
( 4, '{ "name": "Jenn",  "city": {"name": "Tokyo"}}' );
INSERT INTO demo VALUES
( 5, '{ "name": "Jenn",  "city": ["Tokyo"]}' );
INSERT INTO demo VALUES
( 6, '{ "name": "Jenn",  "city": 66}' );
DROP TABLE city;
CREATE TABLE city
(
  id NUMBER,
  city CLOB 
    CONSTRAINT city_ensure_json 
    CHECK (city IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO city VALUES
( 101, '{"city": "SF", "state": "CA", 
  "country": "US"}' );
INSERT INTO city VALUES
( 102, '{"city": "PA", "state": "CA", 
  "country": "US"}' );
INSERT INTO city VALUES
( 103, '{"city": "NYC", "state": "NY", 
  "country": "US"}' );
INSERT INTO city VALUES
( 104, '{"city": {"name": "Tokyo"}, "state": null, 
  "country": "Japan"}' );
INSERT INTO city VALUES
( 105, '{"city": ["Tokyo"], "state": null, 
  "country": "Japan"}' );
INSERT INTO city VALUE
( 106, '{"city": 66, "state": null, 
  "country": "World"}' );
DROP TABLE city_rel;
CREATE TABLE city_rel
(
  id      NUMBER,
  city    VARCHAR(255),
  state   VARCHAR(255),
  country VARCHAR(255));
INSERT INTO city_rel VALUES
( 1001, 'SF', 'CA', 'US' );
INSERT INTO city_rel VALUES
( 1002, 'PA', 'CA', 'US' );
INSERT INTO city_rel VALUES
( 1003, 'NYC', 'NY', 'US' );
INSERT INTO city_rel VALUES
( 1004, '{"name": "Tokyo"}', NULL, 'World' );
INSERT INTO city_rel VALUES
( 1005, '["Tokyo"]', NULL, 'World' );
INSERT INTO city_rel VALUES
( 1006, '66', NULL, 'World' );

JSON Join

The following SQL statement is a simple join between JSON structures on the property “city”:

SELECT *
FROM demo d, city c
WHERE d.person.city = c.city.city;

This SQL statement projects in addition to joining JSON structures:

SELECT d.person, c.city
FROM demo d, city c
WHERE d.person.city = c.city.city;

The following SQL statement extends the projection:

SELECT 
  d.id,
  d.person.name,
  d.person.city,
  c.id,
  c.city.state,
  c.city.country
FROM demo d, city c
WHERE d.person.city = c.city.city;

JSON – Relational Join

 This SQL statement shows the join between JSON and relational data, combined with a projection:

SELECT 
  d.id,
  d.person.name,
  d.person.city,
  c_r.id,
  c_r.city,
  c_r.state,
  c_r.country
FROM demo d, city_rel c_r
WHERE d.person.city = c_r.city;

Significance of Pure and Mixed JSON Joins

As shown, the join operator is applied easily within JSON tables and across JSON and relational tables. When using Oracle 12c there is no restriction anymore when it comes to the join operator in conjunction of JSON documents.

Pure JSON joins are possible in context of Oracle 12c. This means that developers have a choice to model all data in a pure document form (trying to avoid the need for joins by creating sub-collections – which is almost impossible without denormalization), or to consciously model documents in such a way that the document nature is applied where applicable without having to necessarily de-normalize as the join operator is available.

The mixed case between JSON and relational tables goes a lot further as now data can be modeled according to its nature (not all data is exclusively document-oriented or relational) and its access path requirements without compromising either way.

In addition, the mixed case supports the situation where data is already present in the database in relational form and new data is added in JSON form. This means that even if data is available in relational form, additional data does not have to be in relational form, and the most appropriate representation can be chosen (and no separate document-oriented database has to be deployed, btw).

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 2): Basic Queries

This blog provides a small tour of basic SQL queries that operate on JSON in context of the Oracle Database 12c Release 1 (12.1.0.2.0).

Sample Data Set

The following very basic data set is used as an example in this blog. It is kept simple in order to be able to focus on the queries without having to deal with complex JSON objects at the same time.

First, a simple table is created that contains a JSON column. Next, some rows containing JSON objects are inserted.

DROP TABLE demo;
CREATE TABLE demo
( id NUMBER,
  player CLOB 
    CONSTRAINT player_ensure_json 
      CHECK (player IS JSON (STRICT WITH UNIQUE KEYS)));
INSERT INTO demo 
VALUES (1, '{"person": "Bob", "score": 10}');
INSERT INTO demo 
VALUES (2, '{"person": "Bob", "score": 20}');
INSERT INTO demo 
VALUES (3, '{"person": "Jake", "score": 100}');
INSERT INTO demo 
VALUES (4, '{"person": "Jake", "score": 200}');
INSERT INTO demo 
VALUES (5, '{"person": "Alice", "score": 1000}');

With the sample data set in place, we can now construct a complex query in several steps.

Selection and Projection

The most basic query selecting the complete data set is

SELECT * FROM demo d;

A basic projection extracting only the person from the JSON objects is

SELECT d.player.person FROM demo d;

A basic selection restricting the JSON objects is

SELECT d.player.person
FROM demo d
WHERE d.player.person IN ('Jake', 'Bob');

The syntax for accessing properties in JSON objects is in principle

<table alias>.<JSON column>.<path to JSON object key>

with variations on JSON array index references if required (http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246).

A more complex selection with an additional restriction is

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');

Ordering

Results can be ordered, for example, in the following way

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
AND d.player.person IN ('Jake', 'Bob');
ORDER BY d.player.person DESC;

Grouping

Results can be grouped also as a preparation for aggregation

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Aggregation

Different aggregation functions can be used to do some basic analysis

SELECT d.player.person,
  SUM(d.player.score),
  AVG(d.player.score),
  MIN(d.player.score),
  COUNT(*)
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Final Result

The final result is show here in table representation (copied from SQLDeveloper)

result

Inspiration

This example was inspired, in fact, by http://www.querymongo.com. There, the MySQL Query

sql_query

is translated to one of MongoDB’s query interfaces to

mongo_query

(web site accessed on 10/21/2014).

Summary

In summary, SQL functionality is available not only for the relational model in the Oracle Database 12c, but also for JSON-based data.

This makes the Oracle database a quite powerful JSON processing environment as querying JSON data is possible through the declarative SQL language.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 1): Introduction to Native Support

The Oracle Database 12c Release 1 (12.1.0.2.0) [http://www.oracle.com/technetwork/database/enterprise-edition/overview/index.html] introduces native JSON support and SQL access to JSON. This blog post gives a first introduction.

SQL access to JSON: Native Support

SQL access to JSON is a significant development in itself, but native support in context of a relational database it is actually quite huge and exciting. This blog post series will provide a discussion of the various aspects over several installments, including the mixed use of SQL on JSON with SQL on relational tables.

But first things first.

2-Second Overview

This is a 2-second overview showing how to create a table that can store JSON data, how to insert a row containing a JSON object and how to query it with a simple query.

CREATE TABLE supplier
( id NUMBER NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON));
INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');
SELECT * FROM supplier;
        ID SUPPLIER_DOC
---------- -------------------------------------------------
       125 {supplierId: 125, "supplierName": "FastSupplier"}

That was easy 🙂

Creating a Table storing JSON

JSON data are stored in columns of regular tables. A constraint placed on a JSON column enforces JSON compliance:

CREATE TABLE supplier
( id NUMBER NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON));

Any attempt to insert invalid JSON data fails because of this constraint. Other columns are regular relational columns and they can be defined and constrained as necessary.

A JSON column stores JSON objects as well as JSON arrays as both are valid top-level JSON structures. Trying to insert scalars will fail. The following two insert statements are valid:

INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');
INSERT INTO supplier
VALUES (128,
'["empty_list_of_supplier"]');

While the JSON [json.org] standard allows duplicate members (aka, keys), many implementations only tolerate or even outright reject it. To avoid storing JSON objects with duplicate keys, the constraint of a JSON column can be extended:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (WITH UNIQUE KEYS)));

The following insert will fail with the additional constraint “WITH UNIQUE KEYS” (because of a duplicate keys), but would succeed otherwise.

INSERT INTO supplier
VALUES (126,
'{
  "supplierId": 126,
  "supplierName": "FastSupplier",
  "supplierName": "FS"}');

JSON has a defined syntax, however, many implementations relax it by e.g. allowing to state member names without quotes. To enforce a strict syntax, the constraint on a JSON table can be extended:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (STRICT WITH UNIQUE KEYS))

The following insert will fail (because one key is not enclosed in quotes), but succeed without the “STRICT” constraint:

INSERT INTO supplier
VALUES (125,
'{
  supplierId: 125,
  "supplierName": "FastSupplier"}');

With the various constraints and their combinations it is possible to restrict the flavor of JSON that is stored in the database. From an architecture perspective this means that the database can be the central point of enforcing conformity.

Since JSON data is stored in colums it is possible to create several columns in a table that contain JSON data. Here is an example of a table with two columns:

CREATE TABLE supplier
( id number NOT NULL
    CONSTRAINT supplier_pk PRIMARY KEY,
  supplier_doc CLOB
    CONSTRAINT supplier_doc_ensure_json 
      CHECK (supplier_doc IS JSON (STRICT WITH UNIQUE KEYS)),
  history_doc CLOB
    CONSTRAINT history_doc_ensure_json 
      CHECK (history_doc IS JSON (STRICT WITH UNIQUE KEYS)));

From a data modeling perspective this opens up a whole new dimension. A brief discussion follows later in this blog post.

Inserting JSON into a Table

Inserting JSON data into a table that has at least one JSON column is rather straight forward, as the previous examples have shown.

The insert statement must have a valid JSON object or JSON array in the position of the JSON column(s), or SQL NULL as the value of a JSON column can be unknown.

Querying JSON with SQL

Oracle Database 12c supports a number of functions to query and to manipulate JSON data. In the following only a first impression is given and the full set will be discussed in additional separate blog posts.

The most basic query was already introduced:

SELECT * FROM supplier;

Selecting scalars from JSON:

SELECT s.supplier_doc.supplierName FROM supplier s;

The select clause is what one would expect: the table name followed by the column name followed by the key name using dot notation. The return value is a table with a single column containing strings.

This query selects scalar value across different columns:

SELECT s.id, s.supplier_doc.supplierName FROM supplier s;

In order to be able to construct more interesting queries, a more complex JSON object is used (see the end of this blog post for its details – it is not included here as it is quite large).

Querying a JSON object looks like this:

SELECT s.supplier_doc.businessAddress FROM supplier s;

This returns a table with one column containing a string representing a JSON object. For those objects that do not have a ‘businessAddress’, a SQL null is returned.

Querying a JSON array is done like this:

select s.supplier_doc.shippers from supplier s;

Like previously, the dot-notation path leads to the JSON member that contains an array as value.

The queries included so far give a first impression of how to query JSON data in Oracle 12c. Upcoming blog posts will provide much more details on the query capabilities, amongst other discussions.

Table Design

With the introduction of JSON it is now possible to have a data model design that combines relational and JSON data modeling instead of having just a relational data model or just a JSON data model. A single table supports the combination of relational and JSON data types.

Some important design questions are:

  • Should top level scalar data items be separate columns, or top-level keys in a JSON object, or both?
    • In the above examples, a primary key column ‘id’ was designed, as well as a key ‘supplierId’ in the JSON object. The ‘id’ column enforces a primary key and supports access to the supplier identifier without accessing the JSON document. The JSON document, however, contains the supplier identifier to be self-contained.
  • Should all data be in one JSON document or is it more appropriate to have separate JSON objects in different columns?
    • In one of the above table creation statements two JSON columns where defined, one containing the supplier data, and another one containing history about the supplier. This supports the separation of data that should not be combined, e.g., for security or privacy reasons. An internal supplier history is not part of the operational supplier data and should be kept separate.

More design principles and best practices will emerge over time in this context in addition to the ones mentioned here so far.

Documentation

The documentation can be found here: [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246]

Installation Notes

The Oracle Database 12c Release 1  (12.1.0.2.0) is available on Windows as well as Linux and Solaris at the time of this blog post [http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html].

In case you want to run the Linux version on Windows, install Virtual Box (version 4.3.15 build 95923 (!) [https://forums.virtualbox.org/viewtopic.php?f=6&t=62615]) and then create a virtual machine with Oracle Linux 6. Once the VM is up and running, install the Oracle Database 12c Release 1 (12.1.0.2.0) in the VM and you are ready to go.

Example Large JSON Object

INSERT INTO supplier
VALUES (123,
'{
  "supplierId": 123",
  "supplierName": "FastSupplier",
  "rating": 5,
  "shippers": [{
    "shipperName": "TopSpeed",
    "address": {
      "street": "Sunrise",
      "streeNumber": 17,
      "city": "Sun City",
      "state": "CA",
      "zip": 12347
      }
    },
    {
    "shipperName": "PerfectPack",
    "address": {
      "street": "Solid Road",
      "streeNumber": 1771,
      "city": "Granite City",
      "state": "CA",
      "zip": 12348
      }
    },
    {
    "shipperName": "EconomicWay",
    "address": {
    "street": "Narrow Bridge",
    "streeNumber": 1999,
    "city": "Central City",
    "state": "CA",
    "zip": 12345
    }
  }],
  "businessAddress": {
    "street": "Main Street",
    "streeNumber": 25,
    "city": "Central City",
    "state": "CA",
    "zip": 12345
  },
  "warehouseAddress": {
    "street": "Industrial Street",
    "streeNumber": 2552, 
    "city": "Central City",
    "state": "CA",
    "zip": 12346
    }
  }');

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

NoSQL Databases: Data First, Schema Second? Or Vice Versa?

When using NoSQL databases, the notion of ‘schema’ enters the picture sooner or later. But when is the best time? And what to do about it?

Global vs. Local vs. Mixed Document Schema

One of the first distinctions is a global vs. local vs. mixed schema. A global schema is a schema that is defined for a given set of documents, e.g., a collection or table of documents. Every document in the collection must comply to the schema defined for the collection.

A local schema is a schema for a single document. Every document can have its own schema. It is possible that several documents follow the same schema. However, those are in general not grouped based on their schema.

A mixed schema is in part a global schema, and in part a local schema. This means that a document must contain certain properties as defined by the global schema, and the local schema allows additional schema elements on a per-document basis.

In terms of system examples:

  • Oracle NoSQL [http://www.oracle.com/us/products/database/nosql/] follows the local schema approach.
    • Documents in Oracle NoSQL are grouped by keys. Each document can have its own schema, aka, a local schema.
  • MongoDB [http://www.mongodb.org/] follows the mixed schema approach.
    • There is one property that must be present: ‘_id’ in all documents across all MongoDB collections and it must be unique. In this sense, MongoDB does not follow a pure local schema approach as one mandatory property is specified globally.
  • FoundationDB [https://foundationdb.com/] follows the global schema approach.
    • FoundationDB implements the concept of ‘table groups’ and supports query results to be serialized as JSON objects. However, from the viewpoint of the data model, it is relational and the hierarchical structure (aka, sub-documents) comes into play through foreign keys and SQL extensions that have been explored a long time ago in context of NF2 relations.
  • Oracle 12c [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] follows the local schema approach.
    • A JSON document is stored in a column of a table. That column is completely schema-free so that JSON documents of any schema can be stored and hence the schema is local.

Explicit vs. Implicit Document Schema

A second important distinction is an explicit vs. an implicit schema (or extensional vs. intentional schema). An extensional schema is defined through a schema representation format (e.g., Avro [http://avro.apache.org/]) or SQL-style DDL statements. An intentional schema does not have a separate representation, but it can be derived from the structure of a document instance (not always unambiguously).

In terms of system examples:

  • Oracle NoSQL: supports both, explicit and implicit schemas
  • MongoDB: implicit schema
  • FoundationDB: explicit schema
  • Oracle 12c: implicit schema

From a different perspective an implicit schema means that a document can be stored as it is without having to define a schema for it and without checking that it conforms to a schema. An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfully.

The ‘edge case’ is MongoDB that enforces the property ‘_id’ in every document. If it is not present, it will be automatically added. While MongoDB supports mainly an implicit schema, ‘_id’ is the exception.

Mandatory Document Schema

A third important distinction is the requirement for a mandatory schema before documents can be stored. If the database requires a schema, documents cannot be inserted before the schema is specified. If a schema is not mandatory, documents can be stored without having a schema in place.

In terms of system examples:

  • Oracle NoSQL: no mandatory schema
  • MongoDB: no mandatory schema
  • FoundationDB: requires a mandatory schema
  • Oracle 12c: no mandatory schema

Roles of Database Users

Even though one could get the impression, not everybody using a database in his or her role necessarily likes a document database where every document can have its own schema (schema-less, or better, schema-varying database). Some users do, some users don’t. For the sake of discussion, let’s distinguish two roles in this blog:

  • Data Collector. A data collector is a role for collecting data initially. A data collector determines interesting data to store and that data might or might not be used downstream for further processing. However, it is important for some data to be collected in case it becomes important down the road. And it is important to store data unmodified as the rules of modification (e.g., cleansing, transformation) might not be known at the time of the data collection.
  • Data User. A data user fundamentally applies all CRUD (create, read, update, delete) operations on a data set and in most cases through an application system that implements the business logic as well as the business rules. The data user is familiar with the business logic and the business rules in context of the application domain, like a financial application or a forecasting tool.

A data user can be a data collector also as the C (create) function creates data. In this case a data user can be a data collector as well.

Schema First or Schema Second?

From the viewpoint of the two roles, schema management plays an important role. In a black-and-white categorization, the two roles have the following desires:

  • Data Collector. A data collector’s goal is to collect relevant or potentially relevant data. Depending on the data sources, there might not be time or opportunity to define a schema first, and it might not be feasible to maintain a schema in the long run (including schema migration). A data collector therefore likes
    • local schema – implicit schema – not mandatory schema (‘schema second’)
  • Data User. A data user has to accomplish work and prefers clear-cut business rules and business functionality so that the application semantics is clear. A data user therefore likes
    • global schema – explicit schema – mandatory schema (‘schema first’)

The data collector stores data into a data collector database, and the database of the data user is called data user database. Both databases can be the same, or they can be two different databases, as discussed later.

A data collector, after having collected data, might want to query it for e.g. statistical and analytical purposes (e.g., how much data was collected, how many documents have a reference to a product description, etc.). So a data collector would prefer to have a schema after the collection of data in support of query formulation; so schema second.

A data user rarely operates on data directly, however, software engineers have to implement the business logic and business rules. Software engineers, for sure, enjoy certainty when it comes to a schema as variations cause significant code complexity in the general case; so schema first.

Co-existence of Schema First and Schema Second?

In the ideal case, both approaches, schema first and schema second are supported at the same time. This would make both roles happy and support their particular use cases. Let’s explore a few options:

  • ETL (extract – transform – load) from data collector database to data user database
    • This approach suggests an explicit transformation step that extracts data from the data collector database and adds it to the data user database. Along the way data type transformations can take place as well as handling of null values, absent properties, and other data modeling specifics. The extraction can be partial so that only relevant data are extracted. However, data duplication (at least partial) is one downside, amongst others.
  • View on data collector database
    • This approach creates a view through which access is provided on a single database (data collector and data user combined). A view could deal with the various transformation tasks; however, updates and deletions might be difficult or even impossible. So this is a potential solution only for the case of read access.
  • Automatic schema extraction
    • A schema can be derived from an document. It is therefore possible that for the data collector database the set of all possible schema can be made available to the data user. If the data user creates a super-set then a global schema is available. Of course, for a given document the schema only partially applies and the business functionality and business rules have to be aware of this. In this case also, the data collector and data user database are the same.
  • Intelligent ORM Layer
    • An ORM layer could provide the impression of a fixed document structure towards the data user, while being able to deal with the heterogeneous document schemas internally. If the ORM layer is flexible enough, it can provide updates as well as delete functionality, and if necessary, an extension mechanism to add custom code in order to make the delete or update functionality specific to the given document set. In this case also, the data collector and data user database are the same.

From an implementation perspective an ORM layer seems to be a practical approach as it allows to separate the transformation and update/delete logic from the application logic, while operating on a single database. However, every data access has to execute some transformation logic in general.

If space is of secondary concern or if the data set for the data user is a lot smaller than that of the data collector, the ETL approach might be preferable as the transformation logic is separate from the data access logic of the application systems.

Automatic schema extraction is certainly helpful in all cases as the schema has to be known in order to implement the ORM layer or the ETL component. If NoSQL databases start implementing a view mechanism then this might be preferable for read-only access situations.

Schema-Varying Languages?

It is, of course, tempting to ask if there is an easy and elegant way to deal with local schemas in application systems? Is is possible to write an application system that does not require a schema in the first place?

While this is a huge topic on its own, server-side JavaScript might be a good place to start as the language is not based on a class/instance paradigm, but prototype approach. It’s type system is almost equivalent with JSON. The language, therefore, is able to represent documents with local schema easily and effortlessly. Since JavaScript can introspect objects and since it implements the prototype mechanism it is possible to represent local functionality for documents with a local schema. One of the bigger questions is how to represent this flexibility to the end user on user interfaces in an ergonomic way.

However, this is a discussion on its own and I’ll save it for a later blog.

Summary

When is a good time to deal with the notion of ‘schema’ in NoSQL database projects? As the discussion has shown, different databases provide different schema support and different users look for schema support at different point in the data life cycle, if at all.

So unless schema is completely irrelevant in your project (and will stay irrelevant for sure), the discussion cannot start early enough because depending on its importance it might influence the database selection as well as the overall product architecture and implementation effort around schema maintenance and enforcement.

Trending: Multi-Interface and Multi-Data-Model Databases

An interesting development, especially in the NoSQL database space, is the development towards multi-interface and multi-data-model databases, and sometimes both at the same time. While it provides flexibility, it also brings challenges.

Multi-Data-Model Support

In the relational database space, supporting different data models concurrently is not a novelty. Relational databases started off with the relational data model implementation, and later on some of the systems extended the relational model mainly by objects, XML or JSON.

Some databases in the NoSQL space are starting to evolve, too, in this manner and are providing more than one data model concurrently. Some interesting examples are discussed next.

One example in the NoSQL space is Oracle NoSQL [http://www.oracle.com/technetwork/database/database-technologies/nosqldb/]. This system supports a key/value data model whereby the key is used to identify values that are not interpreted by the database itself. In addition, values can be of complex types that are actually interpreted by the database, e.g., in secondary indexes.

Aerospike is another example in the NoSQL space [http://www.aerospike.com/]. Aerospike provides a data model consisting of basic and complex types. In addition, it supports language-native formats as well as large data types that have a specific operational characteristics and data type operations tuned for scale.

Like some relational databases extended data models over time to support specific use cases in a more direct way, some NoSQL databases are also going down that path to more directly support specific application developer needs.

Multi-Interface Support

From an application development perspective a single query API is certainly preferable that provides the complete query expressiveness required. However, especially in the new area of NoSQL databases, it is not clear yet what a good query API actually looks like. This can be observed by different systems providing different query API alternatives.

MongoDB [http://www.mongodb.org/] has a document query interface based on query patterns in form of JSON documents (“Query Documents”). In addition, it provides a map/reduce interface and aggregation pipelines. There are three different APIs that an application developer can choose, and, in addition, they overlap in their functionality. This means that, depending on the query, it can be expressed in all three of them.

Aerospike [http://www.aerospike.com/] provides different language drivers in addition to an Aerospike Query Language.

Cloudant [http://www.cloudant.com], in contrast, supports a REST-api as well as a query interface based on query documents (similar to MongoDB).

Not strictly an external interface, but very important for specific use cases, is the ability to add functionality dynamically to the database in order to move some processing from the application systems into the database itself: user defined functions. For example, MongoDB allows adding functionality through JavaScript functions, whereas Aerospike supports two different types of Lua functions: record user defined functions operate on single records, whereby stream user defined functions support distributed processing.

The Good

Unquestionably, the good part about multi-interface and multi-data-model databases is that an application developer can choose the best combination of data model and access interface for a particular development task. The impedance mismatch between the problem and the solution can be minimized with an appropriate choice.

This also means that developers need to understand the pros and cons of every combination and that requires to go through a learning curve. Going through that learning curve might pay off big time.

In addition, application development teams will have to manage a wider range of implementation variations in terms of application design and engineering, but also in terms of bug fixing and application code maintenance.

The Tricky

The tricky part of multi-interface and multi-data-model databases is that all combinations can be used concurrently, in production as well as post-production (e.g. analytics). Unit and functional tests as well as performance and scale tests become a lot more complicated as they have to test the concurrent execution of various combinations.

Furthermore, many queries can be expressed in different interfaces as those tend to overlap in query expressiveness. So an application developer needs to clearly understand the pros and cons of the query execution that underlies a specific query interface.

Hopefully the query semantics is the same for all combinations (meaning, for example, that predicates are evaluated the same way) and that concurrent use of data models and interfaces does not negatively impact the various clients in terms of concurrency, scale and performance. Any bug introduced through a discrepancy might be very difficult to reproduce and fix.

Summary

While multi-interface and multi-data-model databases are a powerful technology, there is considerable impact to the application system engineering activities in terms of knowledge acquisition, development, test and maintenance.

While database vendors certainly strive to have all combinations work in harmony, there might be edge cases where one combination does not give the same result as a different one. From an application development perspective test coverage should ensure semantic equivalence of the used combinations so that misinterpretations or wrong results are avoided.