SQL for JSON Rationalization Part 14: Cartesian Product of Asterisk Queries

After introducing projection and restriction in JSON SQL, the next blogs will discuss Cartesian product initially, and joins down the road.

Example Data Sets

Three collections form the example data set for this blog. The collections are cp_one, cp_two and cp_three.

cp_one:

{"a":"a-value","b":"b-value"}

cp_two:

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

cp_three:

{"d":[],"e":[]}
{"f":[true],"g":[false]}
{"h":[null],"i":[null]}

Recap: JSON Query Result Representation

To recap, a result of a JSON SQL query can be represented in two different forms. One form is a set of JSON documents, and the other form is a set of rows in a relational table.

For example, the JSON SQL query

select {*} from cp_two

results in

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

The projection syntax indicates through the use of “{” and “}” that the result of the query is to be represented as JSON documents.

In contrast, the JSON SQL query

select * from cp_two

results in

|a_x  |a          |c_y   |c           |
+-----+-----------+------+------------+
|true |{"x":true} |false |{"y":false} |
|null |{"x":null} |<>    |<>          |

Omitting the “{” and “}” in the projection indicates that the result should be in relational table form. Note that the result in a relational table contains a column for each path found in any of the result JSON documents.

Cartesian Product and Join

In short, the Cartesian product is the cross product of the documents of the collections named in the JSON SQL query. A JSON SQL query can reference two or more collections and the result is the cross product of all documents in all referenced collections.

Since the result of a JSON SQL query is a set of JSON documents, the result of a Cartesian product query must be a set of single JSON documents as well. Each JSON document, however, will be a combination of the JSON documents as produced by the Cartesian product.

For example, the Cartesian product of cp_one and cp_two is:

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

The result of the query will be two JSON documents, each is the combination of the document pairs just shown.

A join is a Cartesian product with an applied restriction. The restriction can be simple or complex, depending on the client’s requirements. Joins are not the focus in the next few blogs, but will be front and center down the road.

The definition of the Cartesian product (or join for that matter) is fundamentally not different from the relational equivalent. Instead of creating the cross product of rows, the cross product of JSON documents is computed.

Asterisk Cartesian Product Query

For the benefit of the discussion this blog only discusses Asterisk Cartesian product queries that reference two or more collections. An example query is

select {*} from cp_one as one, cp_two as two

Inherent with Asterisk as projection is the possible duplication of paths in the combined JSON documents. For example, the collections cp_one and cp_two both contain documents with a path “a”.

A duplication of paths is not necessarily the case; if the documents of the referenced collections do not have paths in common, there will be no duplication. However, a duplication of paths is possible in general and this possibility needs to be addressed.

The approach to remove duplicate paths is called duplicate path resolution.

Duplicate Paths in Query Results

JSON document combinations that are the result of a Cartesian product might be disjoint in paths, or might have common partial or full paths. If the paths are not disjoint, the combination of the JSON documents might contain the same path twice.

The Cartesian product of cp_one and cp_two is (as shown above):

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

Ignoring duplicate paths, the result of the equivalent JSON SQL query could be composed like this:

{"a":"a-value","b":"b-value","a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value","a":{"x":null}}

This would simply be the combination of all properties into one JSON document. While the JSON standard does not prohibit duplicate properties in JSON documents, many implementations (e.g., languages, libraries, or storage systems) do not support maintaining duplicate properties consistently. Therefore, to be on the safe side, avoiding duplicate paths is prudent.

For reference, relational systems append e.g. “_1” or “_2” to duplicate column names in order to avoid duplication. However, JSON SQL takes a different approach in order to provide symmetry for the JSON result and the relational result case as both cases have to be addressed.

Automatic Duplicate Path Resolution

Since no schema is in place for any of the involved documents or collections, it is impossible to determine based on a schema if duplicate paths will exist (or not). This means that it is always assumed that there could be duplicate paths.

In order to consistently avoid duplicate paths, several steps are taken.

The first step is requiring a correlation specification for each collection referenced in a JSON SQL query referring to more than one collection. For example,

select * from cp_one as one, cp_two as two

specifies “one” as correlation specification for cp_one, and “two” for cp_two.

The second step is that the results are qualified by the correlation specification. For the result as JSON documents the documents from the Cartesian product become sub-documents where correlation specifications are the top level path.

For example,

select {*} from cp_one as one, cp_two as two

results in

{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":null}}}

As the results show, single documents are returned and are the combination of the corresponding documents coming from the Cartesian product. The correlation specifications are being used as top level paths and so the origin collections of the results become apparent.

The representation of the result in a relational table is analogous: the column names are prepended with the corresponding correlation specifications followed by an underscore “_”.

For example,

select * from cp_one as one, cp_two as two

results in

|one_a     |one_b     |two_a_x |two_a      |two_c_y |two_c       |
+----------+----------+--------+-----------+--------+------------+
|"a-value" |"b-value" |true    |{"x":true} |false   |{"y":false} |
|"a-value" |"b-value" |null    |{"x":null} |<>      |<>          |

Using the correlation specification as top level properties in the JSON document result format or as prefixes in the relational table result format achieves symmetry in avoiding duplication paths.

Role of Correlation Specification

Summarizing, the approach of mandatory correlation specifications combined with their use as prefix or top level properties achieves robust duplicate path resolution that is independent of the specific collections or the documents involved.

  • Since it is unknown if there will be a duplication of paths, queries representing a Cartesian product must have a correlation specification for each of the referenced tables
  • Every projection path is prefixed even if there is no duplication (since without schema it is not possible to know if there is going to be path duplication or not)
  • Selective use of a prefix is impossible due to a path element possibly being equivalent to a correlation specification

Example Cartesian Product Asterisk Queries

Some additional example queries are shown next.

select {*} from cp_two as two, cp_three as three

results in

{"three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

The query

select {*} from cp_one as one, cp_two as two, cp_three as three

results in

{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

Summary

Supporting Cartesian Product in JSON SQL is straight forward and follows the same approach and semantics as in Relational SQL. Aspects like duplicate paths are equivalent to duplicate columns – the same strategies for duplication resolution can be applied.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.