SQL for JSON and Schema Support (Part 2): Where does the “Interesting” Code go?

November 6, 2017 by chbussler

The previous blog found that the “generic” indirect representation of JSON data is one way of supporting “schema-free” JSON objects or documents. Where does the “interesting” functional code live?

Indirect Representation

To recap, the indirect representation is a set of classes, functions, etc. (depending on programming language) that can manage JSON objects or JSON documents. All or most languages have libraries supporting JSON manipulation. For example, Jackson is such a library for Java.

These JSON libraries can manage any valid JSON structure, and they do not require a schema or the JSON objects being homogeneous. Two JSON objects representing the same concept like an order with different attributes (as shown in the previous blog) can be managed by such JSON libraries.

Structural Manipulation

Structural manipulation of JSON objects supports the addition, update or deletion of properties (members) as well as JSON array elements. Property values can be replaced, for example, a JSON string with a JSON object.

Through structural manipulation it is possible to change a JSON object as needed, when e.g. new details appear in form of additional properties.

Structural manipulation was demonstrated in a database context in the last blog: properties were added through the update statement. The same is possible in the indirect representation libraries in the various programming languages.

Computation

Structural manipulation is not the only code that is required as structural manipulation does not allow to compute any specific application semantics. For example, in context of orders, the total value of not yet shipped orders might be a value that needs to be computed.

In a database context this would be an aggregation query that sums up the amount of all orders that do not have the status of shipped.

In context of a programming language it would require a function that iterates through all orders and, like in the database aggregation approach, adds up the sum of those orders that have not shipped yet.

It probably would be implemented as a set of cooperating functions, like

DollarAmount getValueOfOrdersNotShipped(JSONArray orders)
boolean hasOrderShipped(JSONObject order)
DollarAmount getValueOfOrder(JSONObject order)

JSONArray as well as JSONObject are an example of an indirect representation holding order data as a JSON structure.

Note: of course, in the absence of a schema (which is assumed here), there is no assurance that the JSONArray or the JSONObject contain only orders or that the orders are homogeneous in structure. There has to be “trust” that this is indeed the case.

If validation is desired, and if no schema is available, then the only alternative is validating values in one or more JSON object properties. For example, order identifiers might be of a specific structure that uniquely identifies an identifier being an order identifier. This would require trust that the algorithms creating identifiers are correct.

Separation of Manipulation and Computation

The JSON libraries supporting the indirect representation are separate from the functional code (like the summing up of order values). The software architecture and design has to structure this separation and ideally ensures that all functions concerned with orders are “close” from a code structure or software architecture perspective.

There might be functions that can be reused across different concepts (like orders, returns, shipments, etc.), and they can be refactored out, of course, as in “normal” functional code.

Given the above rationalization, how does the absence of a schema come into the picture?

Implication of Schema Free JSON Objects

Since there is no schema, JSON objects can have a different structure even though they represent the same concepts. In context of orders, let’s look at two use cases:

An order does not have a shipping status
An order does have a value but in a variety of data types

In a world without schema these are possible use cases and the functional code needs to check for those.

Addressing the first use case can be accomplished by checking for existence. Code can check if a property is present and react accordingly. In the above example, the code designer can choose to have hasOrderShipped() return false or throw an error in case there is no shipping status.

The second use case can be addressed by checking for the type of the value of the order. If possible, value transformations can be implemented in getValueOfOrder(), e.g., string to number; if it is not possible to transform, an error can be thrown.

Summary

In a schema free JSON context there are several aspects from a code perspective: functional code implementing application semantics is separate from the code that manages the structure of JSON objects. That separation must be carefully managed from an architectural perspective.

The functional code must anticipate non-homogeneous JSON objects and check for variation in order to be able to implement the functionality accurately.

But wait, there is more:-) The next blog will venture into more nuances.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – In-Memory Option (Part 3): JSON Support

June 30, 2015 by chbussler

It is quite natural to view Oracle’s In-Memory Option in context of the relational model; however, the In-Memory Option supports the JSON model at the same time as well.

Oracle JSON Support

As shown earlier in this blog, the Oracle 12c database supports JSON natively [http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246] and incorporates JSON access into SQL so that JSON structures can be accessed through SQL directly. In addition, in a single SQL statement, JSON structures as well as relational tables can be accessed at the same time. Please see the documentation and the other blog entries for more details.

Oracle In-Memory Option

The In-Memory Option [http://www.oracle.com/us/corporate/features/database-in-memory-option/] was introduced in the previous two blog entries and those discussed a relational example as well as queries that allow to introspect the meta data and status of the In-Memory Option processing. It gave an overview as well as the rationale why using the In-Memory Option for analytic queries is advantageous over regular relational processing (columnar representation of data in main memory).

Combination: JSON processing using In-Memory Option

Analytics in context of JSON structures would benefit from In-Memory Option support as well. Many applications based on JSON structures usually have to support some form of analytics. The In-Memory Option supports aggregation functionality on JSON structures as well as on relational tables.

In the following, an example is introduced that creates a table containing a JSON structure, enables it for In-Memory Option support and shows a few simple analytics queries.

Example Table with JSON Structure

The following declares a sample table with a column containing a JSON structure.

DROP TABLE js_players;
CREATE TABLE js_players
 (
   player_id NUMBER NOT NULL,
   player_name VARCHAR(255),
   games VARCHAR(4000) 
    CONSTRAINT games_ensure_json
    CHECK (games IS JSON (STRICT WITH UNIQUE KEYS)));

Table Population

In order to have a large data set the following block creates rows containing JSON structures and inserting them into the above declared table.

DECLARE
  counter NUMBER;
  player_name VARCHAR(128);
  games_value VARCHAR (256);
BEGIN
  counter := 0;
  LOOP
    counter := counter + 1;
    player_name := 'Divvon' || '_' || counter;
    games_value := '{ "games":
      [{"name":"tic-tac-toe", 
        "points":' || counter *10 || '},
       {"name":"one-two-three", 
        "points":' || counter * 100 || '}
      ]}';
    INSERT INTO js_players VALUES
      (counter, player_name, games_value);
  EXIT
    WHEN counter = 1000000;
  END LOOP;
  COMMIT;
  RETURN;
END;

An example JSON document looks like this:

{"games":
  [{"name":"tic-tac-toe", "points":750},
   {"name":"one-two-three", "points":7500}
  ]}

Enabling In-Memory Option

This statement enables the In-Memory Option for the table containing JSON structures:

ALTER TABLE js_players inmemory priority critical;

Aggregation Queries

The following queries show aggregation over elements of the JSON structures. Each query extracts all ‘points’ values from the JSON structure (since every document might have several ‘points’ fields in the array of ‘games’) and makes is accessible as ‘pts.points’. Then it aggregates over this structure. The first query aggregates over all rows, whereas the second query is selecting only a few rows based on ‘player_name’.

SELECT DISTINCT p.player_name,
       SUM(pts.points) sum_points,
       MIN(pts.points) min_points,
       MAX(pts.points) max_points,
       AVG(pts.points) avg_points,
       COUNT(*)
FROM js_players p,
     json_table(p.games, 
                '$.games[*]' COLUMNS (points VARCHAR2(32 CHAR) PATH '$.points')) 
         AS pts
GROUP BY p.player_name
ORDER BY sum_points DESC;

SELECT DISTINCT p.player_name,
       SUM(pts.points) sum_points,
       MIN(pts.points) min_points,
       MAX(pts.points) max_points,
       AVG(pts.points) avg_points,
       COUNT(*)
FROM js_players p,
     json_table(p.games, 
                '$.games[*]' COLUMNS (points VARCHAR2(32 CHAR) PATH '$.points')) 
       AS pts
WHERE p.player_name = 'Divvon_204'
      OR p.player_name = 'Divvon_206'
GROUP BY p.player_name
ORDER BY sum_points DESC;

Summary

Oracle’s In-Memory Option, because it not only supports the relational model, but also the JSON model, is an interesting alternative to analytics in context of JSON data sets. This blog has shown an example of how to combine JSON structures and the In-Memory Option in order to be able to run analytics queries.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – In-Memory Option (Part 2)

May 31, 2015 by chbussler

Part 1 was introducing an Oracle In-Memory Option example; in Part 2 I’ll introduce a few queries that show meta data and statistics in context of the Oracle In-Memory Option.

Table Status

One interesting set of metadata is the table’s in-memory status as well as the compression and loading priority. A table with enabled in-memory option is called “in-memory table” in the following for short. The following query retrieves these metadata for the in-memory table “IM_PLAYERS”:

SELECT table_name,
       inmemory,
       inmemory_compression,
       inmemory_priority
FROM dba_tables
WHERE table_name = 'IM_PLAYERS';

Loading of in-memory tables takes some time on initial load and their load status is important. Only when loaded the full capability of the in-memory functionality is available. The following query shows the load status by retrieving the number of bytes not populated into the in-memory area yet:

SELECT v.segment_name name,
       v.populate_status populate_status,
       v.bytes_not_populated bytes_not_populated
FROM v$im_segments v
ORDER BY 1;

The size of an in-memory table is important in order to determine if the memory allocation for the in-memory option is sufficient. The following query fetches various sizes:

SELECT v.owner,
       v.segment_name,
       v.populate_status,
       ROUND(v.bytes / 1000 / 1000, 2) 
           AS size_in_mb,
       ROUND(v.inmemory_size / 1000 / 1000, 2) 
           AS inmemory_size_in_mb,
       ROUND(v.inmemory_size / v.bytes, 2) * 100 
           AS im_size_percent
FROM v$im_segments v;

Basic Usage

When running queries against in-memory tables the in-memory option might or might not be triggered and used depending on the query optimizer’s decisions. The following two statements executed together show if the in-memory option was used:

EXEC dbms_feature_usage_internal.
         exec_db_usage_sampling(SYSDATE);

SELECT ul.name,
       ul.detected_usages
FROM dba_feature_usage_statistics ul
WHERE ul.version =
          (SELECT MAX(u2.version)
           FROM dba_feature_usage_statistics u2
           WHERE u2.name = ul.name
                 AND ul.name LIKE 'In-%');

A query to an in-memory table advances the value of detected_usages if the in-memory representation of that table was in fact accessed.

Statistics

If a query triggers the use of the columnar representation managed by the in-memory option it is important to understand the specific statistics to see how much optimization took place. The following query retrieves some of all available statistics:

SELECT t.display_name,
       s.value,
       s.sid,
       t.statistic#
FROM v$statname t, v$mystat s
WHERE t.display_name IN ('IM scan CUs columns accessed', 
                         'IM scan segments minmax eligible', 
                         'IM scan CUs pruned')
      AND s.statistic# = t.statistic#
ORDER BY s.sid DESC;

Being able to retrieve the statistics allows now to see the statistics before and after query execution. In the next section a selective as well as aggregate query is run and the statistics examined before and afterwards.

Example

The first example is a selective query that retrieves all columns of two players (rows).

SELECT * 
FROM cb.im_players 
WHERE player_id = 10 OR player_id = 20;

First, the statistics is shown before the query is executed and then the statistics after the query is executed.

The second example is a non-selective query that aggregates over all rows.

SELECT player_name,
       COUNT(*),
       MIN(player_score),
       MAX(player_score),
       AVG(player_score),
       SUM(player_score)
FROM cb.im_players
GROUP BY player_name;

Again, first the statistics before and after the query execution is shown.

Summary

This blog introduced a series of queries that allow to retrieve metadata as well as statistics on query execution in context of the Oracle In-Memory Option. The nice part is that these are regular SQL queries that can be run manually, but also as part of development or operations tools. It is therefore possible to examine the use of the in-memory tables and their behavior in a running application.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – In-Memory Option (Part 1)

April 30, 2015 by chbussler

The name ‘In-Memory Option’ does not mean Oracle turned into a general main memory database or has a new option to do so; it means something else entirely.

Aggregation

Columnar data representation is geared towards aggregation of columns in relational databases. In order to make aggregation fast, the rows of a column should be ‘near’ to each other in terms of memory layout; the best would be if the rows of the column would be next to each other. Access for aggregation is very efficient with this layout. A database implementing this approach is called a columnar database (http://en.wikipedia.org/wiki/Column-oriented_DBMS).

Specialized databases exist that structure the data on disk as well as in memory in a columnar format. Once the data is in such a database, aggregation can be fast. However, the data has to be in the database and chances are that the same data is in a separate OLTP database also (a second copy). This then means that a data integration problem has to be solved in order to ensure that data that is managed in the OLTP database is also available (with sufficiently minor delay) in the columnar database (or columnar store). And, since it changes in the OLTP database, the integration effort is an ongoing process.

In-Memory Option

Oracle 12c’s In-Memory Option [http://www.oracle.com/us/corporate/features/database-in-memory-option/] combines the columnar storage with OLTP storage in the same database. The columnar storage layout is only implemented in main memory, leaving the disk layout as it is. This means that OLTP and columnar access is possible in the same database, concurrently, and the data is always 100% transactionally consistent, as no delayed data integration has to take place. The ‘artificial’ separation of database functionality into separate databases to satisfy aggregation processing requirements is not necessary anymore in such a database management system.

Example

Since Oracle 12c combines the relational and the columnar representation, using the columnar functionality is fully declarative without any operational management overhead.

In the following a small example shows how to make use of the Oracle 12c In-Memory Option concurrently to the regular OLTP load.

Assume a regular table that has been in place for a while and is accessed by a regular application (OLTP use case). This table manages data and is changed according to the application logic.

DROP TABLE im_players;
CREATE TABLE im_players
  (
   player_id NUMBER,
   player_name VARCHAR2(128),
   player_score NUMBER);

Here is a way to populate the table with random content for testing purposes:

DECLARE
  player_id NUMBER;
  player_score NUMBER;
  player_name VARCHAR(128);
BEGIN
  player_id := 0;
  LOOP
    player_id := player_id + 1;
    player_name := 'player_' || player_id;
    player_score := player_id * 10;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
    player_score := player_id * 100;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
    player_score := player_id * 1000;
    INSERT INTO im_players VALUES
      (player_id, player_name, player_score);
  EXIT
    WHEN player_id = 1000000;
  END LOOP;
  COMMIT;
  RETURN;
END;

A new requirement has to be fulfilled based on business needs. The requirement is to analyze the existing table regularly while it is being used by the OLTP application. In the following this requirement is implemented with the Oracle 12c In-Memory Option.

First, memory has to be set aside for the columnar representation of the data in main memory:

ALTER system SET inmemory_size = 300M scope=spfile;

Restarting the database enables this change and this query confirms the setting:

SELECT name, value FROM v$sga;

Now the database has a main memory area (In-Memory Area) set aside for the columnar representation of relational tables.

Next, the existing table definition is modified to additionally be represented in columnar representation:

ALTER TABLE im_players inmemory priority critical;

There are different options that indicate how critical it is to have the rows in columnar format, how to compress depending on the access model, and actually which columns should be in columnar format as not all columns are necessarily required.

The alter statement above indicates that it is critical for the data to be in columnar format. The database will diligently load the data into the columnar memory section. Compression is enabled for low query access. All columns are enabled, based on the use case.

The following query shows how much data is not yet populated. Loading data is a process that takes time and the query supports checking the load status. Once the loading is completed, zero bytes are not yet populated.

SELECT 
  v.segment_name name,
  v.populate_status populate_status,
  v.bytes_not_populated bytes_not_populated
FROM v$im_segments v
ORDER BY 1;

The following query performs an aggregation.

SELECT
  player_name,
  COUNT(*),
  MIN(player_score),
  MAX(player_score),
  AVG(player_score),
  SUM(player_score)
FROM im_players
GROUP BY player_name;

??03

The query plan of this query shows that the columnar representation is being accessed.

----------------------------------------------------------------------------------------
| Id | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT           |            | 3025K | 227M  | 216 (38)   | 00:00:01 |
|  1 | HASH GROUP BY              |            | 3025K | 227M  | 216 (38)   | 00:00:01 |
|  2 | TABLE ACCESS INMEMORY FULL | IM_PLAYERS | 3025K | 227M  | 141 (5)    | 00:00:01 |
----------------------------------------------------------------------------------------

And this query shows access metrics indicating that the query actually executed against the columnar representation. The values after executing the aggregation query are different from the values before the aggregation.

Note on SQL

An important aspect of the Oracle 12c In-Memory Option is that the SQL specifying the required aggregation is regular SQL and no new or different language constructs had to be introduced. The underlying reason is that the query optimizer has sufficient knowledge to make the appropriate access choices based on the SQL presented and does not require hints from users. This is a huge engineering benefit since no new or different language has to be learned to make use of the columnar representation.

Summary

This was a super-brief introduction into the Oracle 12c In-Memory Option. The documentation has of course a lot more background [http://docs.oracle.com/database/121/ADMIN/memory.htm#ADMIN14257]. In addition, a nice blog series exists that introduces the various aspects step-by-step [https://blogs.oracle.com/In-Memory/entry/getting_started_with_oracle_database].

From a data management aspect an optimized representation targeted towards aggregation is enabled declaratively in context of the same database that holds the OLTP data. This is a huge benefit as one database supports the mixed workload and any data integration setup between different databases becomes unnecessary. With significantly increasing main memory sizes this option allows freeing up engineering cycles as well as operational resources while increasing data consistency and processing performance.

Go SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c – SQL for JSON (Part 2): Basic Queries

October 27, 2014 by chbussler

This blog provides a small tour of basic SQL queries that operate on JSON in context of the Oracle Database 12c Release 1 (12.1.0.2.0).

Sample Data Set

The following very basic data set is used as an example in this blog. It is kept simple in order to be able to focus on the queries without having to deal with complex JSON objects at the same time.

First, a simple table is created that contains a JSON column. Next, some rows containing JSON objects are inserted.

DROP TABLE demo;

CREATE TABLE demo
( id NUMBER,
  player CLOB 
    CONSTRAINT player_ensure_json 
      CHECK (player IS JSON (STRICT WITH UNIQUE KEYS)));

INSERT INTO demo 
VALUES (1, '{"person": "Bob", "score": 10}');

INSERT INTO demo 
VALUES (2, '{"person": "Bob", "score": 20}');

INSERT INTO demo 
VALUES (3, '{"person": "Jake", "score": 100}');

INSERT INTO demo 
VALUES (4, '{"person": "Jake", "score": 200}');

INSERT INTO demo 
VALUES (5, '{"person": "Alice", "score": 1000}');

With the sample data set in place, we can now construct a complex query in several steps.

Selection and Projection

The most basic query selecting the complete data set is

SELECT * FROM demo d;

A basic projection extracting only the person from the JSON objects is

SELECT d.player.person FROM demo d;

A basic selection restricting the JSON objects is

SELECT d.player.person
FROM demo d
WHERE d.player.person IN ('Jake', 'Bob');

The syntax for accessing properties in JSON objects is in principle

<table alias>.<JSON column>.<path to JSON object key>

with variations on JSON array index references if required (http://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246).

A more complex selection with an additional restriction is

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');

Ordering

Results can be ordered, for example, in the following way

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
AND d.player.person IN ('Jake', 'Bob');
ORDER BY d.player.person DESC;

Grouping

Results can be grouped also as a preparation for aggregation

SELECT d.player.person
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Aggregation

Different aggregation functions can be used to do some basic analysis

SELECT d.player.person,
  SUM(d.player.score),
  AVG(d.player.score),
  MIN(d.player.score),
  COUNT(*)
FROM demo d
WHERE d.player.score > 0
  AND d.player.person IN ('Jake', 'Bob');
GROUP BY d.player.person
ORDER BY d.player.person DESC;

Final Result

The final result is show here in table representation (copied from SQLDeveloper)

Inspiration

This example was inspired, in fact, by http://www.querymongo.com. There, the MySQL Query

is translated to one of MongoDB’s query interfaces to

(web site accessed on 10/21/2014).

Summary

In summary, SQL functionality is available not only for the relational model in the Oracle Database 12c, but also for JSON-based data.

This makes the Oracle database a quite powerful JSON processing environment as querying JSON data is possible through the declarative SQL language.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Schema-free Database (Part 2): Relational Database Management System (RDBMS)

November 18, 2013 by chbussler

As outlined in Part 1 of this series (https://realprogrammer.wordpress.com/2013/11/02/schema-free-database-part-1-an-oxymoron/), a ‘schema-free database’ is an oxymoron and in fact the notion of schema is changing from a more restrictive to a more flexible interpretation in context of NoSQL database technology.

So it is only consequential to ask the question the other way around (as a thought experiment): is it possible to build a relational database management system that does not enforce a schema, and if so, how would such a system look like on an abstract level?

Yes, it is possible to have a non-schema-enforcing RDBMS. Let’s discuss two variations next.

Definition of No-Schema-Enforcing Relational Database Management System

What functionality would be altered in order to provide a no-schema-enforcing RDBMS? If it were possible to create a table without specifying columns (aka, only a table name), and then to insert, update and delete rows, then a ‘schema-free’ RDBMS would be in place. This would mean in detail:

Rows do not have to comply to a schema when inserted into a table. Different rows in the same table could have different attributes (columns) and the same attributes of different rows could have different domains (flexible type system).
By defining a table without specifying columns (names and domains), a table would not define a primary key, either (neither a simple, nor a composite key). Applications inserting or updating rows can behave nicely and add properties with values that comply to the primary key semantics, but the RDBMS would not be aware of it and consequently would not enforce primary key compliance.
By the same token, foreign keys would not be enforced by the RDBMS for the same reasons.
Since no primary key enforcement is in place, duplicate rows will not be prevented by the RDBMS and any supervision is left to the application systems.
Indexes are independent of schema specification and assuming that indexes are maintained on tables, not all rows might be present in an index if the attributes defined by the index are not contained in a row.

Surprisingly (or not), defining a no-schema-enforcing RDBMS is pretty straight forward.

Variation on No-Schema-Enforcing RDBMS

An interesting variation of a no-schema-enforcing RDBMS could be that a schema, primary keys, foreign keys, etc., are specified as usual, however, without being actively enforced; instead, warnings are given by the RDBMS. For example, a row not complying to the schema can actually be inserted, but the result would not be a ‘OK’, but a warning indicating a schema violation.

This can be described as a ‘middle ground’ in widening the schema interpretation where the RDBMS is aware of a schema and warns of violations without rejecting the various DML operations.

Characterization of No-Schema-Enforcing RDBMS

Could a no-schema-enforcing RDBMS (any of the variations) be a useful database management system? Yes, as it would be the equivalent (on the relational model) to NoSQL databases (on JSON/BSON model or key value model).

For use cases where the flexible schema interpretation is key, such a no-schema-enforcing RDBMS could fit the bill (possibly better) than a NoSQL database system if the use case is fundamentally relational in nature (as opposed to e.g. hierarchical or key/value) and if SQL as the query language is important.

Further Exploration

There are additional areas in a RDBMS that will have to change their behavior in a no-schema-enforcing implementation. Only briefly (and not exhaustively), these are

Triggers. Triggers are specified on tables and state changes of rows. If particular attributes are referenced inside the trigger, then not every update, insert, read or delete will execute the trigger logic.
Stored procedures. Stored procedures often have parameters of specific types and assume a specific set of attributes when processing rows. In a no-schema-enforcing situation the stored procedure has to be able to deal with variations of rows.
Functions and function extensions. Functions have to be changed similarly to stored procedures. Not only from the viewpoint of parameters, but also the processing logic.
Aggregation. Aggregation will have to change in various ways as the various aggregation functions cannot assume that all attributes are of the same type. Neither can they assume that all attributes are actually present in all rows of a table.

In principle, every concept and every implementation aspect of a RDBMS needs to be re-examined wrt. a wider and more flexible interpretation of ‘schema’. NoSQL systems, by their definition and approach, started with a wider interpretation and consequently made all the conceptual and implementation decisions. They are one source of approach in this regard.

Contact Me

If you plan to explore or to build a no-schema-enforcing RDBMS, please contact me.

What’s in a JSON Document Property Name?

December 2, 2012 by chbussler

What is a good choice of property names in JSON documents? And what is a good (semantic) modeling approach?

The Issue

Property names are not only random names; to a large extent they indicate the semantics to humans of what a document contains and represents. Choosing the right property name, and especially property value, its very important for several reasons, like understanding and processing. Two examples will highlight the issues and trigger the discussion.

BadSupplier Example

This is a possible way to describe a supplier:

{"name":"BadSupplier",
 "rating": 0.1,
 "bolts": [{"steel": 25},
           {"aluminum": 55}],
 "nuts": [{"steel": 24},
          {"aluminum": 56}]
}

This document describes a supplier selling two types of parts, bolts and nuts. And each type that is available is made from two different materials, with the number of the available parts of that material being maintained. A supplier has a rating, too.

This is a perfectly valid document, but it potentially has problems. But before discussing those, let’s look at another example.

GoodSupplier Example

{"name":"GoodSupplier",
 "rating": 1.0,
 "parts": [{"type": "bolt", "material":"steel", "availability":25},
           {"type": "bolt", "material":"aluminum", "availability":55},
           {"type": "nut", "material":"steel", "availability":24},
           {"type": "nut", "material":"aluminum", "availability":56}]
}

This is also a perfectly valid document. One obvious difference is that to describe GoodSupplier a lot more properties are used compared to BadSupplier, almost twice as many. But there are other differences that become clear when looking at how the documents are queried.

BadSupplier vs. GoodSupplier: Discussion

Upfront, in reality, it is not only about choosing good property names, but also it is about data modeling (or information modeling), namely, how is the data structured along the criteria for good structuring and modeling. Of course, ‘good’ is in the eye of the beholder, so the discussion here is as seen from my viewpoint.

The best way to have the discussion in my opinion is looking at queries that might be interesting.

What type of parts does a supplier sell?
- GoodSupplier: find all distinct values of ‘type’ from each document in the sub-collection ‘parts’.
- BadSupplier: find all distinct property names from each document, except ‘name’ and ‘rating’.

This query highlights the fact that in the GoodSupplier example, the product types are values, whereby in the BadSupplier example the product types are property names. If different suppliers supply different parts, the query for the GoodSupplier example stays the same, plus it also works across all suppliers. The assumption is that there is a sub-collection ‘parts’ and each part has a ‘type’.

In the BadSupplier example case, each supplier might supply different parts, and so all property names have to be collected from each document (except ‘name’ and ‘rating’). However, it can very well be the case that different suppliers have additional properties (like e.g. Address, Customers, etc.), so this means that those have to be known upfront and filtered out document by document.

Dynamic schema changes can also be very difficult. In the GoodSupplier example, the only two assumptions are that there is a property ‘type’ and a sub-collection ‘parts’, other properties are not relevant for this query. In the BadSupplier example, any new property that will be added to a supplier must be made known and categorized as part type (or not) so that queries can exclude them.

A real problem occurs if the same property name is representing a product type for one supplier, but not a product type for another. In this case the property names representing part types would have to be managed per document.

How many parts are available for each type?
- GoodSupplier: find all documents with the same ‘type’ and add up the value of the property ‘availability’.
- BadSupplier: for all values of all properties (except ‘name’ and ‘rating’), sum up the values of the properties ‘steel’ and ‘aluminum’.

This query highlights that in the GoodSupplier example, the availability for all types can be added up with the same query, no matter what the part types are. In the BadSupplier example, not only is it important to know the properties, but also all materials in order to only add up numbers that represent part availability.

Find all suppliers that have more than 1000 parts on hand.
- GoodSupplier: for each supplier, add up ‘availability’ for each part and if sum is greater than 1000, emit the supplier name.
- BadSupplier: for each supplier, determine the property names that represent parts, and for each of those, find the properties that represent materials. Add up the availability and determine if the sum is greater than 1000.

This query, like the previous ones, requires in the GoodSupplier example case a single query across all suppliers whereas in the BadSupplier example requires to understand the part names for each supplier.

What’s Going on?

In one sentence and in relational terms: In the BadSupplier example for each part and material there is one column. In the GoodSupplier example, there is one row for each part.

In NoSQL terms: in the BadSupplier example, the query must analyze the schema for each supplier document in order to determine the appropriate property names and their values. In the GoodSupplier example, the query must only access property values to determine the query result since there is a schema assumption in place.

Why did I use the labels ‘Bad’ and ‘Good’? Because both are not equal alternatives from my viewpoint. Here are additional reasons aside from the query formulation.

Extensibility. Adding part types does not affect the queries in the GoodSupplier example and neither in the BadSupplier example. Adding more non-part properties will fail the queries for the BadSupplier example. So dynamic schema changes are a problem in the BadSupplier example.

Query Language Support: Query languages (such as the one MongoDB provides), use property names as selectors and retrieve property values. Retrieving property names and conditionally excluding property names is not directly supported. So any operation that requires property name management must be processed outside the database in the client (increasing the complexity substantially).

Summary

While there is no hard and fast rule on how to model document properties, following the idea of the relational model and applying it seems more supportive of the notion of dynamic schema changes and the existing query support.

By ‘following the relational model’ I mean storing instance data as property values, not property names (aka, in rows, not in columns).

Null, Undefined, NaN and Missing Property: Goto Considered Harmful (Part 2)

November 4, 2012 by chbussler

Wait a minute: ‘undefined’ and ‘NaN’ is not legal JSON! Part 1 of this blog must have made a mistake?

MongoDB, JSON and Javascript

The MongoDB shell operates on Javascript data types, not JSON data types. Therefore, this is legal and working (MongoDB 2.2.0):

> use blog
switched to db blog
> db.blogColl.save({"amount":25})
> db.blogColl.save({"amount":null})
> db.blogColl.save({"amount":undefined})
> db.blogColl.save({"amount":NaN})
> db.blogColl.save({"balance":33})
>

So, using Javascript constants like ‘null’, ‘undefined’ and ‘NaN’ is possible and legal in context of the MongoDB shell.

MongoDB: Transformation of Input

As discussed elsewhere, MongoDB does transform data types and values at the point of insertion. When retrieving the above, the following result is displayed:

> db.blogColl.find()
{ "_id" : ObjectId("50955b2eeb749e2e4a36d779"), "amount" : 25 }
{ "_id" : ObjectId("50955b38eb749e2e4a36d77a"), "amount" : null }
{ "_id" : ObjectId("50955b3feb749e2e4a36d77b"), "amount" : null }
{ "_id" : ObjectId("50955b47eb749e2e4a36d77c"), "amount" : NaN }
{ "_id" : ObjectId("50955b4eeb749e2e4a36d77d"), "balance" : 33 }

MongoDB decided to honor the Javascript ‘null’ and ‘NaN’, but ‘undefined’ is changed to ‘null’. From a data modeling perspective this is significant as there is no difference in meaning between ‘undefined’ and ‘null’ in the context of MongoDB. An application having a distinct interpretation of those values will cause errors due to the distinction in interpretation it makes, but not MongoDB.

MongoDB: Aggregation Results

Coming back to Part 1 now, the question is, how would the default aggregation operators of MongoDB interpret the above data set in MongoDB? The following is an aggregation framework query:

db.blogColl.aggregate(
    { $group : {
        _id : 1,
        noDocs: { $sum : 1},
        totalamount: { $sum : "$amount" },
        averageamount: { $avg : "$amount" },
        minamount: { $min : "$amount"},
        maxamount: { $max : "$amount"},
        totalbalance: { $sum : "$balance"},
        averagebalance: { $avg : "$balance" },
        minbalance: { $min : "$balance"},
        maxbalance: { $max : "$balance"}
    }}
);

The ‘group’ pipeline step ensures that all documents are seen as one group for the purpose of aggregation. The aggregation operators are applied to all documents in that group. In this example, to all documents.

The documents have either ‘amount’ or ‘balance’ as properties and for both some aggregate values are computed. The outcome when executing the above is:

{
        "result" : [
                {
                        "_id" : 1,
                        "noDocs" : 5,
                        "totalamount" : NaN,
                        "averageamount" : NaN,
                        "maxamount" : 25,
                        "totalbalance" : 33,
                        "averagebalance" : 33,
                        "maxbalance" : 33
                }
        ],
        "ok" : 1
}

There are a few things to call out:

The minimum aggregation operator was not successful and the result was left out (the reason is unclear to me).
The maximum operator ignored ‘null’ and ‘NaN’ and produced the correct values.
The sum and average operators do not ignore ‘null’ and ‘NaN’; so if a non-number is present, the operators indicate that by stating that there was at least one entry that is not a number.

Implications to Modeling Considerations

Part 1 of this blog highlighted that fact that there needs to be a common understanding on dynamic schemas as well as the interpretation of property values, especially constants and (aggregation) operations on those.

Unless your interpretation precisely matches the interpretation of MongoDB, you will have to implement those operators and their interpretation outside MongoDB or in a different way with the capabilities that MongoDB provides.

The above example shows that this is an important design discussion that must take place before system construction in order to avoid bigger problems down the road.