SQL for JSON and Schema Support (Part 3): Intermezzo 1 – MongoDB’s $jsonschema

MongoDB introduced support for JSON Schema through $jsonschema. Let’s explore this new functionality a bit in this blog.

$jsonschema

The functionality is introduced here: https://docs.mongodb.com/master/reference/operator/query/jsonSchema/#op._S_jsonSchema It states “$jsonSchema can be used in a document validator, which enforces that inserted or updated documents are valid against the schema.”

A first item to note is that this approach is supporting BSON types (http://bsonspec.org/, https://docs.mongodb.com/master/reference/operator/query/type/), not just JSON structures (https://www.json.org/), using a specific property “bsonType” that is not part of the JSON Schema standard (http://json-schema.org/).

A second observation is that the schema specification is inline with the collection creation and cannot refer to a separate JSON schema file or JSON object representing a JSON schema.

JSON Schema Validator Example

Let’s use the example of the first blog in this series, create a schema for it and use that as a constraint for the “orders” collection. Then documents are added to the collection (and there seem to be errors as well). For reference the version used is: MongoDB server version: 3.6.0.

> mongo
> use schema_exploration
> db.createCollection("orders", {
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "string"
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  }
 })
{ "ok" : 1 }

A quick note: “bsonType” can be used in all levels in order to refer to BSON types, not just on the top level.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": 55,
   "numberOrdered": 30
  }, {
   "itemId": 56,
   "numberOrdered": 31
  }]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

Along the way I ran into a validation issue as I constraint the array elements to strings, rather than objects, as used in the example of the first blog in this series. So I made a schema definition mistake.

To note is that the response on the shell does not indicate what the problem was making debugging hard, especially when large and complex schemas are to be debugged.

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 1
})

Once I realized the mistake I made, I inserted a document complying to the schema in order to make sure I identified the issue correctly.

JSON Schema Validator Update

Obviously, after defining a wrong schema, the correct schema should be used as validator.

This is the correct schema:

{
 "bsonType": "object",
 "required": ["orderId", "orderDate", "orderLineItems"],
 "properties": {
  "orderId": {
   "bsonType": "int",
   "description": "Order Identifier: must be of 
                  type int and is required"
  },
  "orderDate": {
   "bsonType": "date",
   "description": "Order Date: must be of 
                  type date and is required"
  },
  "orderLineItems": {
   "bsonType": "array",
   "items": {
    "bsonType": "object",
    "properties": {
     "itemId": {
      "bsonType": "int"
     },
     "numberOrdered": {
      "bsonType": "int"
     }
    }
   },
   "description": "Order Line Items: must be of 
                  type array and is required"
   }
  }
 }

And this is the command to update the validator:

> db.runCommand({
  "collMod": "orders",
  "validator": {
   "$jsonSchema": {
    "bsonType": "object",
    "required": ["orderId", "orderDate", "orderLineItems"],
    "properties": {
     "orderId": {
      "bsonType": "int",
      "description": "Order Identifier: must be of 
                     type int and is required"
     },
     "orderDate": {
      "bsonType": "date",
      "description": "Order Date: must be of 
                     type date and is required"
     },
     "orderLineItems": {
      "bsonType": "array",
      "items": {
       "bsonType": "object",
       "properties": {
        "itemId": {
         "bsonType": "int"
        },
        "numberOrdered": {
         "bsonType": "int"
        }
       }
      },
      "description": "Order Line Items: must be of 
                     type array and is required"
     }
    }
   }
  },
  "validationLevel": "strict"
 })
{ "ok" : 1 }

Some background on the command used is here: https://docs.mongodb.com/master/reference/command/collMod/.

Following is an attempt to add one more of the (now mismatching) documents:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": ["a", "b"]
 })
WriteResult({
 "nInserted": 0,
 "writeError": {
  "code": 121,
  "errmsg": "Document failed validation"
 }
})

As it should be, the insert fails.

And here the insert of a now correct document:

> db.orders.insert({
  "orderId": NumberInt(1),
  "orderDate": new Date("2017-09-30"),
  "orderLineItems": [{
   "itemId": NumberInt(55),
   "numberOrdered": NumberInt(20)
  }, {
   "itemId": NumberInt(56),
   "numberOrdered": NumberInt(21)
  }]
 });
WriteResult({
 "nInserted": 1
})

Collection Inconsistency: Mismatch of Schema and Documents

There is an interesting issue appearing at this point. The new schema does not match all existing documents in the collection. Or the other way around: the collection now contains documents that do not match that schema.

> db.orders.find()
{
 "_id": ObjectId("5a2022c3fb460d15db9ec73e"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": ["a", "b"]
} {
 "_id": ObjectId("5a202322fb460d15db9ec741"),
 "orderId": 1,
 "orderDate": ISODate("2017-09-30T00:00:00Z"),
 "orderLineItems": [{
  "itemId": 55,
  "numberOrdered": 20
 }, {
  "itemId": 56,
  "numberOrdered": 21
 }]
}

MongoDB did not flag that there are documents in the collection that will not match the new schema (even though the validation level strict was used).

Adding a validation action with value of “error” does not change the situation, either.

Implication to Semantics

Given that the schema of a collection can be changed at any time, and given that MongoDB does not fail the schema update based on mismatching documents already in the collection, examining the schema is insufficient to understand the structure of the documents in a collection.

So a collection with a schema does not ensure that all documents in that collection are schema compliant. It rather insures that from the point in time the schema was added or updated documents will have to comply. Previous documents in the collections are not affected.

Summary

The notion of “schema” in context of MongoDB is very different from the notion of “schema” in context of relational database management systems. In MongoDB the documents in the collection do not have to comply to the schema; they only do have to comply at time of insertion.

There will be more exploration coming up in the next blog on this topic in order to further understand the semantics of “schema” in context of MongoDB.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Advertisement

SQL for JSON and Schema Support (Part 2): Where does the “Interesting” Code go?

The previous blog found that the “generic” indirect representation of JSON data is one way of supporting “schema-free” JSON objects or documents. Where does the “interesting” functional code live?

Indirect Representation

To recap, the indirect representation is a set of classes, functions, etc. (depending on programming language) that can manage JSON objects or JSON documents. All or most languages have libraries supporting JSON manipulation. For example, Jackson is such a library for Java.

These JSON libraries can manage any valid JSON structure, and they do not require a schema or the JSON objects being homogeneous. Two JSON objects representing the same concept like an order with different attributes (as shown in the previous blog) can be managed by such JSON libraries.

Structural Manipulation

Structural manipulation of JSON objects supports the addition, update or deletion of properties (members) as well as JSON array elements. Property values can be replaced, for example, a JSON string with a JSON object.

Through structural manipulation it is possible to change a JSON object as needed, when e.g. new details appear in form of additional properties.

Structural manipulation was demonstrated in a database context in the last blog: properties were added through the update statement. The same is possible in the indirect representation libraries in the various programming languages.

Computation

Structural manipulation is not the only code that is required as structural manipulation does not allow to compute any specific application semantics. For example, in context of orders, the total value of not yet shipped orders might be a value that needs to be computed.

In a database context this would be an aggregation query that sums up the amount of all orders that do not have the status of shipped.

In context of a programming language it would require a function that iterates through all orders and, like in the database aggregation approach, adds up the sum of those orders that have not shipped yet.

It probably would be implemented as a set of cooperating functions, like

DollarAmount getValueOfOrdersNotShipped(JSONArray orders)
boolean hasOrderShipped(JSONObject order)
DollarAmount getValueOfOrder(JSONObject order)

JSONArray as well as JSONObject are an example of an indirect representation holding order data as a JSON structure.

Note: of course, in the absence of a schema (which is assumed here), there is no assurance that the JSONArray or the JSONObject contain only orders or that the orders are homogeneous in structure. There has to be “trust” that this is indeed the case.

If validation is desired, and if no schema is available, then the only alternative is validating values in one or more JSON object properties. For example, order identifiers might be of a specific structure that uniquely identifies an identifier being an order identifier. This would require trust that the algorithms creating identifiers are correct.

Separation of Manipulation and Computation

The JSON libraries supporting the indirect representation are separate from the functional code (like the summing up of order values). The software architecture and design has to structure this separation and ideally ensures that all functions concerned with orders are “close” from a code structure or software architecture perspective.

There might be functions that can be reused across different concepts (like orders, returns, shipments, etc.), and they can be refactored out, of course, as in “normal” functional code.

Given the above rationalization, how does the absence of a schema come into the picture?

Implication of Schema Free JSON Objects

Since there is no schema, JSON objects can have a different structure even though they represent the same concepts. In context of orders,  let’s look at two use cases:

  • An order does not have a shipping status
  • An order does have a value but in a variety of data types

In a world without schema these are possible use cases and the functional code needs to check for those.

Addressing the first use case can be accomplished by checking for existence. Code can check if a property is present and react accordingly. In the above example, the code designer can choose to have hasOrderShipped() return false or throw an error in case there is no shipping status.

The second use case can be addressed by checking for the type of the value of the order. If possible, value transformations can be implemented in getValueOfOrder(), e.g., string to number; if it is not possible to transform, an error can be thrown.

Summary

In a schema free JSON context there are several aspects from a code perspective: functional code implementing application semantics is separate from the code that manages the structure of JSON objects. That separation must be carefully managed from an architectural perspective.

The functional code must anticipate non-homogeneous JSON objects and check for variation in order to be able to implement the functionality accurately.

But wait, there is more:-) The next blog will venture into more nuances.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON and Schema Support (Part 1): Preliminaries

Missing schema support and schema enforcement is touted as a good thing: is it really?

“Schema Free”, “Flexible Schema”, “Schema Per Document”

What is meant by a database supporting the JSON data structure without providing support and enforcement for schemas? Such a database is often characterized as “schema free”, or supporting a “flexible schema” or “schema per document”. What does it mean?

No matter how such a database supporting JSON it is labelled, it does not provide an interface to define, to manage or to enforce schema(s) for the data, aka JSON documents, it is managing (“enforcing” is used in the semantics a relational database enforces a schema). This means that a client (e.g., application code) can store JSON documents that have any form as long as those comply to the JSON (syntax) standard (and possibly proprietary extensions by the database system).

It furthermore means that JSON documents representing instances of the same concepts (like e.g. orders or games or employees) do not have to have the same structure. Those JSON documents can be different from each other, not only in values, but also in structure.

An example follows of a possible scenario (using MongoDB).

Example

The example stores initially two documents that have the same structure, and subsequently their structure diverges through updates. No schema enforcement prevents the changes.

use blog;
db.blogColl.insert({
  "orderId": 1,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 20
    },
    {
      "itemId": 56,
      "numberOrdered": 21
  }]
});
db.blogColl.insert({
  "orderId": 2,
  "orderDate": "9/30/2017",
  orderLineItems: [{
      "itemId": 55,
      "numberOrdered": 30
    }, 
    {
      "itemId": 56,
      "numberOrdered": 31
  }]
});
db.blogColl.update({
    "orderId": 1
  }, {
    "$set": {
    "specialInstructions": 
      "Drop of in front, not back of location"
  }
});
db.blogColl.update({
    "orderId": 2,
    "orderLineItems.itemId": 55
  }, {
    $set: {
      "orderLineItems.$.color": "transparent"
  }
});

The ability to store different JSON documents with different structures, even if they represent (instances of) the same concept, can be seen as a powerful feature. It allows modifying the data as needed to represent changing requirements or specific representation needs. Data migration is easier, too, as data can be changed in place.

This flexibility also has downsides and (engineering) cost that need to be considered and dealt with in a concrete implementation.

Application Implementation

Application code accessing a database has a full or partial representation of the data it queries (or in general manages) in the type system of the deployed programming language.

There are basically two choices an application (short for application code) has to represent data:

  • Direct representation
  • Indirect representation

In a direct representation the concept as stored in the database is defined as data structure in the programming language. For example, using Java as the programming language example, an order is represented as a Java class “Order”. This class has all the methods required to access the various elements of an order (that might be implemented as Java classes themselves). In this approach an order stored in the database, when queried, will be managed as an instance of the Java class Order in the application code. Methods support access or modification to the instance of order, and the methods are order semantics specific, like getOrderDate() or updateOrderLineItem() or totalNumberInidividualItems().

In the indirect representation, an order would be represented not as instance of a Java class that reflects the concept, but an instance of a “meta” Java class. This “meta” class is able to store all data from the database, not just orders. Such a class would have methods like createInstance(), setIdentifier(), setType(), addAttribute(), etc. A type would be “Order”, an attribute would be “lineItem”, etc.

Consequences Of Choice

Given the two representations discussed earlier (direct and indirect), the choice seems to be clear. While the direct representation can capture the semantics of a concept directly (aka, a Java class “Order” can implement order specific methods), this approach would not be able to easily (or at all) deal with changes in the database representation of orders. For example, if an additional attribute is added to the JSON document representing an order (as shown above), the Java class would not be able to change dynamically and capture it.

The indirect representation, however, would not have any problems representing order JSON objects with different structure as e.g. attributes can be dynamically added. This means that instances of the “meta” Java class can represent any JSON document as stored in the database.

These “meta” Java classes are actually already available in form of JSON processing libraries. Such libraries support the creation of JSON structure representations and they can represent any JSON object (or JSON array) that implements a correct JSON syntax.

Summary

At a first glance it looks like databases that support JSON without enforcing a schema at the same time are a good choice for ease of data management. And, using the indirect representation approach applications can deal with dynamically changing JSON objects or JSON objects of different structure representing the same concept.

However, as always, there are more details to discuss and additional aspects are going to be examined in the next blog.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 19: Implementation Notes

What was the decision process wrt. SQL syntax and execution semantics in context of JSON SQL?

Syntax

There are only two choices when adding JSON support to SQL: (a) design a new syntax (read: grammar) from scratch, or (b) extend an existing grammar.

Many approaches opted for the alternative (a). For JSON SQL that has been discussed in this blog series, I opted for the alternative (b).

The basic approach was to take an existing SQL grammar, in this case SQL 92 (https://ronsavage.github.io/SQL/sql-92.bnf.html) and implement it in ANTLR4. In the course of this implementation the JSON specific support required was added.

This comprised of only a few types of extensions or additions to the syntax. The main modifications are (all were discussed in individual blogs in the blog series):

  • Functions like exists_path() and is_of_type()
  • Constants like JSON true, JSON false and JSON null
  • Paths in order to refer to any property at any level into a JSON object
  • Constructors for JSON array and JSON object (like {} and []) asides from numbers and strings

In addition, the projection syntax was extended to support { and } in order to indicate that JSON objects are to be returned, and not tables.

Execution Semantics

Implementing the execution semantics for JSON SQL is based on the relational execution semantics. A direct mapping was implemented that maps SQL queries with JSON elements to SQL queries without JSON support.

This approach was chosen since re-implementing the relational semantics is not really necessary. It only needs to be extended for the JSON specific support and semantics. Existing relational databases put quite a bit of work in optimizing execution, and so a direct mapping allowed to take advantage of all the work that has been done.

Furthermore, mapping a SQL query to a SQL query allows easy semantic description and debugging as the semantics stays on a declarative level without going into specific algorithms.

In addition, this approach proves that JSON SQL does not require a different SQL execution semantics, only an extension.

Summary

In summary, JSON support in context of SQL can be a pure extension of the SQL syntax (language) as well as a pure extension of the relational semantics. It is unnecessary to define a special variant of SQL; extending a given (standard) syntax is possible and sufficient. The same applies to the JSON execution semantics – it can be a pure extension of the relational execution semantics.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 18: Set Operators, Sorting, Grouping and Subqueries

Further operators like set operators, sorting, grouping and sub-queries apply to JSON SQL as well. This blog discusses some of the additional operators.

Union, Difference (Except), Intersection (Intersect)

The set operators union (“UNION”), difference (“EXCEPT”), and intersection (“INTERSECT”) are supported by Relational SQL and in order for those to apply the inputs to the set operators have to have the correct schema. In context of JSON SQL there are no schema requirements or restrictions and the set operators operate on sets of JSON documents and implement the usual semantics.

Set operators rely on JSON document equality and as discussed earlier equality is recursively defined on the properties of JSON documents. Two JSON documents are equal if they have the same set of paths with each pair of paths (one from each document) leading to the same scalar values.

An example query is the following, combining all parts available in the US as well as Europe.

select {*}
from us_parts
union
select {*}
from eur_parts

Sorting

Sorting of result sets can be supported by JSON SQL as well. Paths can be defined in the order by part of a JSON SQL query and sorting takes place on the values the paths are referring to (“sorting paths”). In context of JSON documents that do not have to comply to a fixed schema a special interpretation is necessary for a few cases:

  • A property that is absent (aka, the path specified in the sorting section of the query does not exist in a document) cannot be sorted on. One possible semantics is that the absence of a value is the largest or lowest value possible and the document is sorted accordingly.

    A more recent SQL standard introduced the clause “NULLS FIRST” and “NULLS LAST” in order to define where SQL NULL is placed in a sorted result. The same could be followed here with e.g. “ABSENT FIRST” or “ABSENT LAST”.
  • Another case is type heterogeneity, meaning, the same path in different documents refers to different JSON types. In this case a possible strategy is to sort within each type, and then order the types based on a predefined order, like, null, true, false, string, number, object, array (arbitrary, but fixed order).

    Following the same idea of “NULLS FIRST” and “NULLS LAST”, a clause could be added the defines the type order, like “TYPE ORDER JSON_NULL, JSON_TRUE, JSON_FALSE, JSON_STRING, JSON_NUMBER, JSON_OBJECT, JSON_ARRAY”.

Unless the sorting paths of all documents in a result set comply to the same schema a total order cannot solely established based on values, but required additional rules like those outlined above in the bullet list.

The following example sorts by shipper rating.

select {*}
from shipper sh
order by sh.rating desc 
         absent last 
         type order json_null, json_true, json_false, 
                    json_string, json_number, 
                    json_object, json_array

Grouping and Having

Grouping of result documents can be implemented in JSON SQL as in Relational SQL with the usual aggregation functions. The having construct can be applied as well to select from the groups. Grouping is defined by paths into the JSON documents and the same discussion wrt. missing values or type heterogeneity applies as in the sorting discussion.

The following lists all states and shipper rating averages where the shippers have an average rating about a certain threshold.

select {sh.state, avg(sh.rating)}
from shipper sh
group by sh.state
having avg(sh.rating) > 5

Subqueries

JSON SQL can support sub-queries like Relational SQL does. In principle, a JSON SQL query can return results as object as well as relations. In context of a sub-query results are only returned in form of JSON documents.

The following example lists shippers in states that has been shipped to in the past.

select {*}
from shipper
where shipper.state in
  (select s.state
   from states s
   where s.shipped_to = true)

Summary

This brief discussion selected a few additional Relational SQL operators and has shown how they can be interpreted in context of JSON SQL. A this point I am confident that the complete Relational SQL semantics can be extended for the JSON types without restriction and with possible semantic interpretation extension due to the possible absence of a fixed schema.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 17: Cartesian Product with Restriction (Join) (Again!)

There is a lot more to be said about joins in context of JSON SQL beyond the introduction in the previous blog.

“Join Homogeneous” Schema

The previous blog’s sample data set was homogeneous in the sense that all paths used in join criteria had a value in all documents. There was never the case that a path did not have a value. This is analogous to Relational SQL where columns used in joins always have values by virtue of the existence of a schema.

Let’s explore “join heterogeneity” in this blog. As usual, the sample data set is introduced first.

Sample Data Set

select {*} from foo

results in

{"a":{"b":5},"n":null,"x":{"y":"foobar"}}
{"a":{"b":10},"n":false}

and

select {*} from bar

results in

{"a":{"b":5},"n":true,"x":{"y":"foobar"}}
{"a":{"b":11},"n":null,"x":"missing"}

Homogeneous Join

The following join is homogeneous as the paths involved in the join criteria all have a value.

select {*} 
from  foo as f, 
      bar as b 
where f.a = b.a

Results in

{"b":{"a":{"b":5},"n":true,"x":{"y":"foobar"}},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Null vs. Absent Value

In the JSON standard JSON null is a value. Compared to Relational SQL, JSON null does not express “unknown”. The equivalent to Relational SQL NULL is the absence of the value in JSON SQL. Therefore, a join where the paths involved in a join criteria have the value JSON null are homogeneous joins.

select {*} 
from  foo as f, 
      bar as b 
where f.n = b.n

results in

{"b":{"a":{"b":11},"n":null,"x":"missing"},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Heterogeneous Join

A heterogeneous join in context of JSON SQL has paths in the join criteria that do not exist in at least one document, aka, do not refer to values in this case.

For example, the path x.y does not refer to a value in all documents of the example data set.

The semantics is that if a document does not have a value at the path of the join criteria the document does not participate in the Cartesian product, and therefore does not provide a document to the result set.

select {*} 
from  foo as f, 
      bar as b 
where f.x.y = b.x.y

results in

{"b":{"a":{"b":5},"n":true,"x":{"y":"foobar"}},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Check for Missing Values

JSON SQL provides a predicate that supports checking the presence (or absence) of values. This predicate can be used to check if a join is going to be a homogeneous join or a heterogeneous join.

select {*} 
from  foo 
where not exists_path x.y

results in

{"a":{"b":10},"n":false}

and

select {*} 
from  bar 
where not exists_path x.y

results in

{"a":{"b":11},"n":null,"x":"missing"}

These queries show that the previous query is a heterogeneous join as not all documents contain the join paths.

In the absence of schema support for JSON this allows to check for homogeneity in context of joins, like a dynamic schema check for a very specific purpose. During software development it can be determined if it is important to have a homogeneous join or if a heterogeneous join is sufficient. Depending on the requirement and outcome of the query checking for path existing appropriate error handling can take place.

Summary

JSON SQL supports homogeneous as well as heterogeneous joins without any extra syntax or special execution semantics. Furthermore, with the predicate for checking existence the developer is given a tool to determine if a join is going to be homogeneous or heterogeneous.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 16: Cartesian Product with Restriction (Join)

Restrictions can be added to a Cartesian Product and this is briefly discussed in this blog. It demonstrates the power of joins in context of JSON documents.

Example Data Set

As always, the sample data sets that are being used for queries in this blog are introduced first.

select {*} from jer

results in

{"a":1,"b":20,"c":true,"d":{"x":"y"}}
{"a":2,"b":21,"c":true,"d":{"x":[null,5]}}

and

select {*} from tom

results in

{"a":3,"b":20,"c":false,"d":{"x":"y"}}
{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}

This data set is used in the following to introduce restrictions in context of a Cartesian Product.

Join and Join Criteria

The following demonstrates a join where the join criteria (restriction) is based on a scalar type.

select {*} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

Remember, that the result documents from each of the collections are disambiguated by adding the root property “j” and “t” (aka, the correlation specifications).

A join can be empty if the join criteria do not derive to a result, as shown in the following.

select {*} 
from   jer as j, tom as t 
where  j.a = t.a

does not return a result.

Projection can be applied as well.

select {t.b} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"t":{"b":20}}
{"t":{"b":21}}

Using an AS clause in the projection allows to reshape the result.

select {t.b as tb} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"tb":20}
{"tb":21}

Join criteria can be defined not only on top level scalar properties, but on any JSON structure on any level. The following two queries illustrate this.

select {*} 
from   jer as j, tom as t 
where  j.d.x.[1] = t.d.x.q

results in

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

and

select {*} 
from   jer as j, tom as t 
where  j.d = t.d

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

Of course, equality is not the only possible operator for join criteria.

select {*} 
from   jer as j, tom as t 
where  j.a < t.a

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

and so does

select {*} 
from   jer as j, tom as t 
where  j.a <> t.a

Cartesian Product with Restriction

Cartesian products can be restricted with non-join criteria.

select {*} 
from   jer as j, tom as t 
where  j.c = true 
       or t.c = false

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

Join with Join and Non-Join Criteria

And a mix of join and non-join criteria is possible as well.

select {*} 
from   jer as j, tom as t 
where  j.d = t.d 
       and j.b = t.b 
       and (j.c = true or t.c = false)

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

Summary

Joins are a powerful feature of JSON SQL as demonstrated in this blog as it supports the combination of documents in different collections without having to foresee their combination when deciding on the document structures. Joins combine the power of JSON documents with the power of value-based correlation of documents.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 15: Cartesian Product and Projection

In part 14 of this blog series Cartesian Product queries were discussed that did have an Asterisk projection; this blog discusses specific paths as projection (non-Asterisk).

Example Data Set

As always, the sample data sets that are being used for queries in this blog are introduced first.

select {*} from ying

results in

{"a":3,"c":20}
{"a":4,"c":21}

and

select {*} from yang

results in

{"a":1,"b":10}
{"a":2,"b":11}

Projection

To recap, JSON SQL supports JSON projection as well as relational projection. JSON projection is specified by enclosing paths within a set of curly brackets: {}. This will cause the query result represented as JSON objects.

For example, the following query returns JSON objects.

select {a, b} from yang

results in

{"a":1,"b":10}
{"a":2,"b":11}

JSON SQL returns relational results when the set of curly brackets is omitted; the following query returns the result as table.

select a, b from yang

results in

|a                        |b                        |
+-------------------------+-------------------------+
|1                        |10                       |
|2                        |11                       |

Projection without AS in Joins

The following is a projection of a join resulting in JSON objects.

select {yi.a, ya.b} from ying as yi, yang as ya

results in

{"ya":{"b":10},"yi":{"a":3}}
{"ya":{"b":11},"yi":{"a":3}}
{"ya":{"b":10},"yi":{"a":4}}
{"ya":{"b":11},"yi":{"a":4}}

The same query with results represented as relation is specified as follows.

select yi.a, ya.b from ying as yi, yang as ya

results in

|yi_a                     |ya_b                     |
+-------------------------+-------------------------+
|3                        |10                       |
|3                        |11                       |
|4                        |10                       |
|4                        |11                       |

Observe that the results include the table correlation specifiers “yi” or “ya”. This is necessary since different collections might have documents with the same paths. The following query highlights this case.

select {yi.a, ya.a} from ying as yi, yang as ya

results in

{"ya":{"a":1},"yi":{"a":3}}
{"ya":{"a":2},"yi":{"a":3}}
{"ya":{"a":1},"yi":{"a":4}}
{"ya":{"a":2},"yi":{"a":4}}

This automatic result qualification using correlation specifications ensures that path duplicates are automatically resolved in the results.

Projection with AS in Joins

In many cases the automatic duplicate resolution is sufficient for clients. However, in some cases this is not desired. In those cases the AS clause allows the placement of result values into any place of JSON documents using the AS clause. In the relational result case the columns can be named as desired.

select {yi.a as b, ya.a as c} from ying as yi, yang as ya

results in

{"b":3,"c":1}
{"b":3,"c":2}
{"b":4,"c":1}
{"b":4,"c":2}

The above shows a simple renaming of the paths.

select {yi.a as x.b, ya.a as y.[0]} from ying as yi, yang as ya

results in

{"x":{"b":3},"y":[1]}
{"x":{"b":3},"y":[2]}
{"x":{"b":4},"y":[1]}
{"x":{"b":4},"y":[2]}

This query shows a more complex result object creation and goes beyond simple renaming of paths.

The following query shows how specific column names are specified.

select yi.a as x, ya.a as y from ying as yi, yang as ya

results in

|x                        |y                        |
+-------------------------+-------------------------+
|3                        |1                        |
|3                        |2                        |
|4                        |1                        |
|4                        |2                        |

Summary

In summary, defining projection in context of SQL JSON joins is straightforward and supports flexible renaming of columns in context of relational results as well as expressive result value positioning as paths in JSON object results.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Oracle 12c Release 2: A Complete JSON Database

Today’s blog focuses on Oracle 12c Release 2. With this release of the database additional new interesting functionality was introduced: sharding and analytics support for JSON. Very interesting in context of JSON processing – future blogs will continue the JSON SQL discussion.

Oracle as a JSON Database

With Oracle 12c Release 2’s functional additions Oracle 12c is providing all major functional areas of functionality in context of JSON processing. Non-functional areas (backup, restore, replication, HA/DR support,  etc.) are supported as well, and there is no need to discuss those here.

The major areas of functionality in this context are

  • JSON OLTP Processing. Oracle 12c supports general OLTP (online transaction processing) functionality. JSON documents are stored in tables and accessed through SQL (full DML support).
  • JSON Analytics. Oracle 12c supports JSON analytics processing in the Oracle In-Memory option providing columnar representation and columnar processing.
  • JSON Sharding. Oracle 12c supports managed sharding of data in tables of independently running databases (shards); up to 1000 shards currently.

Details of JSON OLTP, Analytics and Sharding

The following presentation contains a first level of details for the three areas above. I gave it during an Oracle Code event (https://go.oracle.com/oraclecode) in New York (https://developer.oracle.com/code/newyork) earlier this month. References to more detailed Oracle 12c documentation are provided in the presentation itself.

The presentation can be downloaded from here: Oracle Code NYC Presentation.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 14: Cartesian Product of Asterisk Queries

After introducing projection and restriction in JSON SQL, the next blogs will discuss Cartesian product initially, and joins down the road.

Example Data Sets

Three collections form the example data set for this blog. The collections are cp_one, cp_two and cp_three.

cp_one:

{"a":"a-value","b":"b-value"}

cp_two:

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

cp_three:

{"d":[],"e":[]}
{"f":[true],"g":[false]}
{"h":[null],"i":[null]}

Recap: JSON Query Result Representation

To recap, a result of a JSON SQL query can be represented in two different forms. One form is a set of JSON documents, and the other form is a set of rows in a relational table.

For example, the JSON SQL query

select {*} from cp_two

results in

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

The projection syntax indicates through the use of “{” and “}” that the result of the query is to be represented as JSON documents.

In contrast, the JSON SQL query

select * from cp_two

results in

|a_x  |a          |c_y   |c           |
+-----+-----------+------+------------+
|true |{"x":true} |false |{"y":false} |
|null |{"x":null} |<>    |<>          |

Omitting the “{” and “}” in the projection indicates that the result should be in relational table form. Note that the result in a relational table contains a column for each path found in any of the result JSON documents.

Cartesian Product and Join

In short, the Cartesian product is the cross product of the documents of the collections named in the JSON SQL query. A JSON SQL query can reference two or more collections and the result is the cross product of all documents in all referenced collections.

Since the result of a JSON SQL query is a set of JSON documents, the result of a Cartesian product query must be a set of single JSON documents as well. Each JSON document, however, will be a combination of the JSON documents as produced by the Cartesian product.

For example, the Cartesian product of cp_one and cp_two is:

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

The result of the query will be two JSON documents, each is the combination of the document pairs just shown.

A join is a Cartesian product with an applied restriction. The restriction can be simple or complex, depending on the client’s requirements. Joins are not the focus in the next few blogs, but will be front and center down the road.

The definition of the Cartesian product (or join for that matter) is fundamentally not different from the relational equivalent. Instead of creating the cross product of rows, the cross product of JSON documents is computed.

Asterisk Cartesian Product Query

For the benefit of the discussion this blog only discusses Asterisk Cartesian product queries that reference two or more collections. An example query is

select {*} from cp_one as one, cp_two as two

Inherent with Asterisk as projection is the possible duplication of paths in the combined JSON documents. For example, the collections cp_one and cp_two both contain documents with a path “a”.

A duplication of paths is not necessarily the case; if the documents of the referenced collections do not have paths in common, there will be no duplication. However, a duplication of paths is possible in general and this possibility needs to be addressed.

The approach to remove duplicate paths is called duplicate path resolution.

Duplicate Paths in Query Results

JSON document combinations that are the result of a Cartesian product might be disjoint in paths, or might have common partial or full paths. If the paths are not disjoint, the combination of the JSON documents might contain the same path twice.

The Cartesian product of cp_one and cp_two is (as shown above):

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

Ignoring duplicate paths, the result of the equivalent JSON SQL query could be composed like this:

{"a":"a-value","b":"b-value","a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value","a":{"x":null}}

This would simply be the combination of all properties into one JSON document. While the JSON standard does not prohibit duplicate properties in JSON documents, many implementations (e.g., languages, libraries, or storage systems) do not support maintaining duplicate properties consistently. Therefore, to be on the safe side, avoiding duplicate paths is prudent.

For reference, relational systems append e.g. “_1” or “_2” to duplicate column names in order to avoid duplication. However, JSON SQL takes a different approach in order to provide symmetry for the JSON result and the relational result case as both cases have to be addressed.

Automatic Duplicate Path Resolution

Since no schema is in place for any of the involved documents or collections, it is impossible to determine based on a schema if duplicate paths will exist (or not). This means that it is always assumed that there could be duplicate paths.

In order to consistently avoid duplicate paths, several steps are taken.

The first step is requiring a correlation specification for each collection referenced in a JSON SQL query referring to more than one collection. For example,

select * from cp_one as one, cp_two as two

specifies “one” as correlation specification for cp_one, and “two” for cp_two.

The second step is that the results are qualified by the correlation specification. For the result as JSON documents the documents from the Cartesian product become sub-documents where correlation specifications are the top level path.

For example,

select {*} from cp_one as one, cp_two as two

results in

{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":null}}}

As the results show, single documents are returned and are the combination of the corresponding documents coming from the Cartesian product. The correlation specifications are being used as top level paths and so the origin collections of the results become apparent.

The representation of the result in a relational table is analogous: the column names are prepended with the corresponding correlation specifications followed by an underscore “_”.

For example,

select * from cp_one as one, cp_two as two

results in

|one_a     |one_b     |two_a_x |two_a      |two_c_y |two_c       |
+----------+----------+--------+-----------+--------+------------+
|"a-value" |"b-value" |true    |{"x":true} |false   |{"y":false} |
|"a-value" |"b-value" |null    |{"x":null} |<>      |<>          |

Using the correlation specification as top level properties in the JSON document result format or as prefixes in the relational table result format achieves symmetry in avoiding duplication paths.

Role of Correlation Specification

Summarizing, the approach of mandatory correlation specifications combined with their use as prefix or top level properties achieves robust duplicate path resolution that is independent of the specific collections or the documents involved.

  • Since it is unknown if there will be a duplication of paths, queries representing a Cartesian product must have a correlation specification for each of the referenced tables
  • Every projection path is prefixed even if there is no duplication (since without schema it is not possible to know if there is going to be path duplication or not)
  • Selective use of a prefix is impossible due to a path element possibly being equivalent to a correlation specification

Example Cartesian Product Asterisk Queries

Some additional example queries are shown next.

select {*} from cp_two as two, cp_three as three

results in

{"three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

The query

select {*} from cp_one as one, cp_two as two, cp_three as three

results in

{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

Summary

Supporting Cartesian Product in JSON SQL is straight forward and follows the same approach and semantics as in Relational SQL. Aspects like duplicate paths are equivalent to duplicate columns – the same strategies for duplication resolution can be applied.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.