SQL for JSON Rationalization Part 19: Implementation Notes

What was the decision process wrt. SQL syntax and execution semantics in context of JSON SQL?

Syntax

There are only two choices when adding JSON support to SQL: (a) design a new syntax (read: grammar) from scratch, or (b) extend an existing grammar.

Many approaches opted for the alternative (a). For JSON SQL that has been discussed in this blog series, I opted for the alternative (b).

The basic approach was to take an existing SQL grammar, in this case SQL 92 (https://ronsavage.github.io/SQL/sql-92.bnf.html) and implement it in ANTLR4. In the course of this implementation the JSON specific support required was added.

This comprised of only a few types of extensions or additions to the syntax. The main modifications are (all were discussed in individual blogs in the blog series):

  • Functions like exists_path() and is_of_type()
  • Constants like JSON true, JSON false and JSON null
  • Paths in order to refer to any property at any level into a JSON object
  • Constructors for JSON array and JSON object (like {} and []) asides from numbers and strings

In addition, the projection syntax was extended to support { and } in order to indicate that JSON objects are to be returned, and not tables.

Execution Semantics

Implementing the execution semantics for JSON SQL is based on the relational execution semantics. A direct mapping was implemented that maps SQL queries with JSON elements to SQL queries without JSON support.

This approach was chosen since re-implementing the relational semantics is not really necessary. It only needs to be extended for the JSON specific support and semantics. Existing relational databases put quite a bit of work in optimizing execution, and so a direct mapping allowed to take advantage of all the work that has been done.

Furthermore, mapping a SQL query to a SQL query allows easy semantic description and debugging as the semantics stays on a declarative level without going into specific algorithms.

In addition, this approach proves that JSON SQL does not require a different SQL execution semantics, only an extension.

Summary

In summary, JSON support in context of SQL can be a pure extension of the SQL syntax (language) as well as a pure extension of the relational semantics. It is unnecessary to define a special variant of SQL; extending a given (standard) syntax is possible and sufficient. The same applies to the JSON execution semantics – it can be a pure extension of the relational execution semantics.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 18: Set Operators, Sorting, Grouping and Subqueries

Further operators like set operators, sorting, grouping and sub-queries apply to JSON SQL as well. This blog discusses some of the additional operators.

Union, Difference (Except), Intersection (Intersect)

The set operators union (“UNION”), difference (“EXCEPT”), and intersection (“INTERSECT”) are supported by Relational SQL and in order for those to apply the inputs to the set operators have to have the correct schema. In context of JSON SQL there are no schema requirements or restrictions and the set operators operate on sets of JSON documents and implement the usual semantics.

Set operators rely on JSON document equality and as discussed earlier equality is recursively defined on the properties of JSON documents. Two JSON documents are equal if they have the same set of paths with each pair of paths (one from each document) leading to the same scalar values.

An example query is the following, combining all parts available in the US as well as Europe.

select {*}
from us_parts
union
select {*}
from eur_parts

Sorting

Sorting of result sets can be supported by JSON SQL as well. Paths can be defined in the order by part of a JSON SQL query and sorting takes place on the values the paths are referring to (“sorting paths”). In context of JSON documents that do not have to comply to a fixed schema a special interpretation is necessary for a few cases:

  • A property that is absent (aka, the path specified in the sorting section of the query does not exist in a document) cannot be sorted on. One possible semantics is that the absence of a value is the largest or lowest value possible and the document is sorted accordingly.

    A more recent SQL standard introduced the clause “NULLS FIRST” and “NULLS LAST” in order to define where SQL NULL is placed in a sorted result. The same could be followed here with e.g. “ABSENT FIRST” or “ABSENT LAST”.
  • Another case is type heterogeneity, meaning, the same path in different documents refers to different JSON types. In this case a possible strategy is to sort within each type, and then order the types based on a predefined order, like, null, true, false, string, number, object, array (arbitrary, but fixed order).

    Following the same idea of “NULLS FIRST” and “NULLS LAST”, a clause could be added the defines the type order, like “TYPE ORDER JSON_NULL, JSON_TRUE, JSON_FALSE, JSON_STRING, JSON_NUMBER, JSON_OBJECT, JSON_ARRAY”.

Unless the sorting paths of all documents in a result set comply to the same schema a total order cannot solely established based on values, but required additional rules like those outlined above in the bullet list.

The following example sorts by shipper rating.

select {*}
from shipper sh
order by sh.rating desc 
         absent last 
         type order json_null, json_true, json_false, 
                    json_string, json_number, 
                    json_object, json_array

Grouping and Having

Grouping of result documents can be implemented in JSON SQL as in Relational SQL with the usual aggregation functions. The having construct can be applied as well to select from the groups. Grouping is defined by paths into the JSON documents and the same discussion wrt. missing values or type heterogeneity applies as in the sorting discussion.

The following lists all states and shipper rating averages where the shippers have an average rating about a certain threshold.

select {sh.state, avg(sh.rating)}
from shipper sh
group by sh.state
having avg(sh.rating) > 5

Subqueries

JSON SQL can support sub-queries like Relational SQL does. In principle, a JSON SQL query can return results as object as well as relations. In context of a sub-query results are only returned in form of JSON documents.

The following example lists shippers in states that has been shipped to in the past.

select {*}
from shipper
where shipper.state in
  (select s.state
   from states s
   where s.shipped_to = true)

Summary

This brief discussion selected a few additional Relational SQL operators and has shown how they can be interpreted in context of JSON SQL. A this point I am confident that the complete Relational SQL semantics can be extended for the JSON types without restriction and with possible semantic interpretation extension due to the possible absence of a fixed schema.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

 

SQL for JSON Rationalization Part 17: Cartesian Product with Restriction (Join) (Again!)

There is a lot more to be said about joins in context of JSON SQL beyond the introduction in the previous blog.

“Join Homogeneous” Schema

The previous blog’s sample data set was homogeneous in the sense that all paths used in join criteria had a value in all documents. There was never the case that a path did not have a value. This is analogous to Relational SQL where columns used in joins always have values by virtue of the existence of a schema.

Let’s explore “join heterogeneity” in this blog. As usual, the sample data set is introduced first.

Sample Data Set

select {*} from foo

results in

{"a":{"b":5},"n":null,"x":{"y":"foobar"}}
{"a":{"b":10},"n":false}

and

select {*} from bar

results in

{"a":{"b":5},"n":true,"x":{"y":"foobar"}}
{"a":{"b":11},"n":null,"x":"missing"}

Homogeneous Join

The following join is homogeneous as the paths involved in the join criteria all have a value.

select {*} 
from  foo as f, 
      bar as b 
where f.a = b.a

Results in

{"b":{"a":{"b":5},"n":true,"x":{"y":"foobar"}},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Null vs. Absent Value

In the JSON standard JSON null is a value. Compared to Relational SQL, JSON null does not express “unknown”. The equivalent to Relational SQL NULL is the absence of the value in JSON SQL. Therefore, a join where the paths involved in a join criteria have the value JSON null are homogeneous joins.

select {*} 
from  foo as f, 
      bar as b 
where f.n = b.n

results in

{"b":{"a":{"b":11},"n":null,"x":"missing"},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Heterogeneous Join

A heterogeneous join in context of JSON SQL has paths in the join criteria that do not exist in at least one document, aka, do not refer to values in this case.

For example, the path x.y does not refer to a value in all documents of the example data set.

The semantics is that if a document does not have a value at the path of the join criteria the document does not participate in the Cartesian product, and therefore does not provide a document to the result set.

select {*} 
from  foo as f, 
      bar as b 
where f.x.y = b.x.y

results in

{"b":{"a":{"b":5},"n":true,"x":{"y":"foobar"}},
 "f":{"a":{"b":5},"n":null,"x":{"y":"foobar"}}}

Check for Missing Values

JSON SQL provides a predicate that supports checking the presence (or absence) of values. This predicate can be used to check if a join is going to be a homogeneous join or a heterogeneous join.

select {*} 
from  foo 
where not exists_path x.y

results in

{"a":{"b":10},"n":false}

and

select {*} 
from  bar 
where not exists_path x.y

results in

{"a":{"b":11},"n":null,"x":"missing"}

These queries show that the previous query is a heterogeneous join as not all documents contain the join paths.

In the absence of schema support for JSON this allows to check for homogeneity in context of joins, like a dynamic schema check for a very specific purpose. During software development it can be determined if it is important to have a homogeneous join or if a heterogeneous join is sufficient. Depending on the requirement and outcome of the query checking for path existing appropriate error handling can take place.

Summary

JSON SQL supports homogeneous as well as heterogeneous joins without any extra syntax or special execution semantics. Furthermore, with the predicate for checking existence the developer is given a tool to determine if a join is going to be homogeneous or heterogeneous.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 16: Cartesian Product with Restriction (Join)

Restrictions can be added to a Cartesian Product and this is briefly discussed in this blog. It demonstrates the power of joins in context of JSON documents.

Example Data Set

As always, the sample data sets that are being used for queries in this blog are introduced first.

select {*} from jer

results in

{"a":1,"b":20,"c":true,"d":{"x":"y"}}
{"a":2,"b":21,"c":true,"d":{"x":[null,5]}}

and

select {*} from tom

results in

{"a":3,"b":20,"c":false,"d":{"x":"y"}}
{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}

This data set is used in the following to introduce restrictions in context of a Cartesian Product.

Join and Join Criteria

The following demonstrates a join where the join criteria (restriction) is based on a scalar type.

select {*} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

Remember, that the result documents from each of the collections are disambiguated by adding the root property “j” and “t” (aka, the correlation specifications).

A join can be empty if the join criteria do not derive to a result, as shown in the following.

select {*} 
from   jer as j, tom as t 
where  j.a = t.a

does not return a result.

Projection can be applied as well.

select {t.b} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"t":{"b":20}}
{"t":{"b":21}}

Using an AS clause in the projection allows to reshape the result.

select {t.b as tb} 
from   jer as j, tom as t 
where  j.b = t.b

results in

{"tb":20}
{"tb":21}

Join criteria can be defined not only on top level scalar properties, but on any JSON structure on any level. The following two queries illustrate this.

select {*} 
from   jer as j, tom as t 
where  j.d.x.[1] = t.d.x.q

results in

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

and

select {*} 
from   jer as j, tom as t 
where  j.d = t.d

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

Of course, equality is not the only possible operator for join criteria.

select {*} 
from   jer as j, tom as t 
where  j.a < t.a

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

and so does

select {*} 
from   jer as j, tom as t 
where  j.a <> t.a

Cartesian Product with Restriction

Cartesian products can be restricted with non-join criteria.

select {*} 
from   jer as j, tom as t 
where  j.c = true 
       or t.c = false

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

{"j":{"a":2,"b":21,"c":true,"d":{"x":[null,5]}},
 "t":{"a":4,"b":21,"c":false,"d":{"x":{"p":null,"q":5}}}}

Join with Join and Non-Join Criteria

And a mix of join and non-join criteria is possible as well.

select {*} 
from   jer as j, tom as t 
where  j.d = t.d 
       and j.b = t.b 
       and (j.c = true or t.c = false)

results in

{"j":{"a":1,"b":20,"c":true,"d":{"x":"y"}},
 "t":{"a":3,"b":20,"c":false,"d":{"x":"y"}}}

Summary

Joins are a powerful feature of JSON SQL as demonstrated in this blog as it supports the combination of documents in different collections without having to foresee their combination when deciding on the document structures. Joins combine the power of JSON documents with the power of value-based correlation of documents.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 15: Cartesian Product and Projection

In part 14 of this blog series Cartesian Product queries were discussed that did have an Asterisk projection; this blog discusses specific paths as projection (non-Asterisk).

Example Data Set

As always, the sample data sets that are being used for queries in this blog are introduced first.

select {*} from ying

results in

{"a":3,"c":20}
{"a":4,"c":21}

and

select {*} from yang

results in

{"a":1,"b":10}
{"a":2,"b":11}

Projection

To recap, JSON SQL supports JSON projection as well as relational projection. JSON projection is specified by enclosing paths within a set of curly brackets: {}. This will cause the query result represented as JSON objects.

For example, the following query returns JSON objects.

select {a, b} from yang

results in

{"a":1,"b":10}
{"a":2,"b":11}

JSON SQL returns relational results when the set of curly brackets is omitted; the following query returns the result as table.

select a, b from yang

results in

|a                        |b                        |
+-------------------------+-------------------------+
|1                        |10                       |
|2                        |11                       |

Projection without AS in Joins

The following is a projection of a join resulting in JSON objects.

select {yi.a, ya.b} from ying as yi, yang as ya

results in

{"ya":{"b":10},"yi":{"a":3}}
{"ya":{"b":11},"yi":{"a":3}}
{"ya":{"b":10},"yi":{"a":4}}
{"ya":{"b":11},"yi":{"a":4}}

The same query with results represented as relation is specified as follows.

select yi.a, ya.b from ying as yi, yang as ya

results in

|yi_a                     |ya_b                     |
+-------------------------+-------------------------+
|3                        |10                       |
|3                        |11                       |
|4                        |10                       |
|4                        |11                       |

Observe that the results include the table correlation specifiers “yi” or “ya”. This is necessary since different collections might have documents with the same paths. The following query highlights this case.

select {yi.a, ya.a} from ying as yi, yang as ya

results in

{"ya":{"a":1},"yi":{"a":3}}
{"ya":{"a":2},"yi":{"a":3}}
{"ya":{"a":1},"yi":{"a":4}}
{"ya":{"a":2},"yi":{"a":4}}

This automatic result qualification using correlation specifications ensures that path duplicates are automatically resolved in the results.

Projection with AS in Joins

In many cases the automatic duplicate resolution is sufficient for clients. However, in some cases this is not desired. In those cases the AS clause allows the placement of result values into any place of JSON documents using the AS clause. In the relational result case the columns can be named as desired.

select {yi.a as b, ya.a as c} from ying as yi, yang as ya

results in

{"b":3,"c":1}
{"b":3,"c":2}
{"b":4,"c":1}
{"b":4,"c":2}

The above shows a simple renaming of the paths.

select {yi.a as x.b, ya.a as y.[0]} from ying as yi, yang as ya

results in

{"x":{"b":3},"y":[1]}
{"x":{"b":3},"y":[2]}
{"x":{"b":4},"y":[1]}
{"x":{"b":4},"y":[2]}

This query shows a more complex result object creation and goes beyond simple renaming of paths.

The following query shows how specific column names are specified.

select yi.a as x, ya.a as y from ying as yi, yang as ya

results in

|x                        |y                        |
+-------------------------+-------------------------+
|3                        |1                        |
|3                        |2                        |
|4                        |1                        |
|4                        |2                        |

Summary

In summary, defining projection in context of SQL JSON joins is straightforward and supports flexible renaming of columns in context of relational results as well as expressive result value positioning as paths in JSON object results.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 14: Cartesian Product of Asterisk Queries

After introducing projection and restriction in JSON SQL, the next blogs will discuss Cartesian product initially, and joins down the road.

Example Data Sets

Three collections form the example data set for this blog. The collections are cp_one, cp_two and cp_three.

cp_one:

{"a":"a-value","b":"b-value"}

cp_two:

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

cp_three:

{"d":[],"e":[]}
{"f":[true],"g":[false]}
{"h":[null],"i":[null]}

Recap: JSON Query Result Representation

To recap, a result of a JSON SQL query can be represented in two different forms. One form is a set of JSON documents, and the other form is a set of rows in a relational table.

For example, the JSON SQL query

select {*} from cp_two

results in

{"a":{"x":true},"c":{"y":false}}
{"a":{"x":null}}

The projection syntax indicates through the use of “{” and “}” that the result of the query is to be represented as JSON documents.

In contrast, the JSON SQL query

select * from cp_two

results in

|a_x  |a          |c_y   |c           |
+-----+-----------+------+------------+
|true |{"x":true} |false |{"y":false} |
|null |{"x":null} |<>    |<>          |

Omitting the “{” and “}” in the projection indicates that the result should be in relational table form. Note that the result in a relational table contains a column for each path found in any of the result JSON documents.

Cartesian Product and Join

In short, the Cartesian product is the cross product of the documents of the collections named in the JSON SQL query. A JSON SQL query can reference two or more collections and the result is the cross product of all documents in all referenced collections.

Since the result of a JSON SQL query is a set of JSON documents, the result of a Cartesian product query must be a set of single JSON documents as well. Each JSON document, however, will be a combination of the JSON documents as produced by the Cartesian product.

For example, the Cartesian product of cp_one and cp_two is:

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

The result of the query will be two JSON documents, each is the combination of the document pairs just shown.

A join is a Cartesian product with an applied restriction. The restriction can be simple or complex, depending on the client’s requirements. Joins are not the focus in the next few blogs, but will be front and center down the road.

The definition of the Cartesian product (or join for that matter) is fundamentally not different from the relational equivalent. Instead of creating the cross product of rows, the cross product of JSON documents is computed.

Asterisk Cartesian Product Query

For the benefit of the discussion this blog only discusses Asterisk Cartesian product queries that reference two or more collections. An example query is

select {*} from cp_one as one, cp_two as two

Inherent with Asterisk as projection is the possible duplication of paths in the combined JSON documents. For example, the collections cp_one and cp_two both contain documents with a path “a”.

A duplication of paths is not necessarily the case; if the documents of the referenced collections do not have paths in common, there will be no duplication. However, a duplication of paths is possible in general and this possibility needs to be addressed.

The approach to remove duplicate paths is called duplicate path resolution.

Duplicate Paths in Query Results

JSON document combinations that are the result of a Cartesian product might be disjoint in paths, or might have common partial or full paths. If the paths are not disjoint, the combination of the JSON documents might contain the same path twice.

The Cartesian product of cp_one and cp_two is (as shown above):

{"a":"a-value","b":"b-value"} {"a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value"} {"a":{"x":null}}

Ignoring duplicate paths, the result of the equivalent JSON SQL query could be composed like this:

{"a":"a-value","b":"b-value","a":{"x":true},"c":{"y":false}}
{"a":"a-value","b":"b-value","a":{"x":null}}

This would simply be the combination of all properties into one JSON document. While the JSON standard does not prohibit duplicate properties in JSON documents, many implementations (e.g., languages, libraries, or storage systems) do not support maintaining duplicate properties consistently. Therefore, to be on the safe side, avoiding duplicate paths is prudent.

For reference, relational systems append e.g. “_1” or “_2” to duplicate column names in order to avoid duplication. However, JSON SQL takes a different approach in order to provide symmetry for the JSON result and the relational result case as both cases have to be addressed.

Automatic Duplicate Path Resolution

Since no schema is in place for any of the involved documents or collections, it is impossible to determine based on a schema if duplicate paths will exist (or not). This means that it is always assumed that there could be duplicate paths.

In order to consistently avoid duplicate paths, several steps are taken.

The first step is requiring a correlation specification for each collection referenced in a JSON SQL query referring to more than one collection. For example,

select * from cp_one as one, cp_two as two

specifies “one” as correlation specification for cp_one, and “two” for cp_two.

The second step is that the results are qualified by the correlation specification. For the result as JSON documents the documents from the Cartesian product become sub-documents where correlation specifications are the top level path.

For example,

select {*} from cp_one as one, cp_two as two

results in

{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "two":{"a":{"x":null}}}

As the results show, single documents are returned and are the combination of the corresponding documents coming from the Cartesian product. The correlation specifications are being used as top level paths and so the origin collections of the results become apparent.

The representation of the result in a relational table is analogous: the column names are prepended with the corresponding correlation specifications followed by an underscore “_”.

For example,

select * from cp_one as one, cp_two as two

results in

|one_a     |one_b     |two_a_x |two_a      |two_c_y |two_c       |
+----------+----------+--------+-----------+--------+------------+
|"a-value" |"b-value" |true    |{"x":true} |false   |{"y":false} |
|"a-value" |"b-value" |null    |{"x":null} |<>      |<>          |

Using the correlation specification as top level properties in the JSON document result format or as prefixes in the relational table result format achieves symmetry in avoiding duplication paths.

Role of Correlation Specification

Summarizing, the approach of mandatory correlation specifications combined with their use as prefix or top level properties achieves robust duplicate path resolution that is independent of the specific collections or the documents involved.

  • Since it is unknown if there will be a duplication of paths, queries representing a Cartesian product must have a correlation specification for each of the referenced tables
  • Every projection path is prefixed even if there is no duplication (since without schema it is not possible to know if there is going to be path duplication or not)
  • Selective use of a prefix is impossible due to a path element possibly being equivalent to a correlation specification

Example Cartesian Product Asterisk Queries

Some additional example queries are shown next.

select {*} from cp_two as two, cp_three as three

results in

{"three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

The query

select {*} from cp_one as one, cp_two as two, cp_three as three

results in

{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":true},"c":{"y":false}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"d":[],"e":[]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"f":[true],"g":[false]},
 "two":{"a":{"x":null}}}
{"one":{"a":"a-value","b":"b-value"},
 "three":{"h":[null],"i":[null]},
 "two":{"a":{"x":null}}}

Summary

Supporting Cartesian Product in JSON SQL is straight forward and follows the same approach and semantics as in Relational SQL. Aspects like duplicate paths are equivalent to duplicate columns – the same strategies for duplication resolution can be applied.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 13: Comparison Operators for JSON Object and JSON Array

As promised in a previous blog the discussion of comparison operators in context of JSON object and JSON array is following in this blog.

Comparison Operators = and <>

Equality and inequality are very straightforward comparison operators and are discussed first. Both are defined on the paths to properties as well as the JSON types of properties.

JSON array equality is (recursively) defined as follows. Two JSON arrays are equal if

  • They have the same number of array indexes starting at index 0
  • The value of each array element is equal for the same index in each of the two JSON arrays

Implicitly this means that order matters in the sense that array elements are compared according to their index position.

JSON object equality is (recursively) defined as follows. Two JSON objects are equal if

  • They have the exact same set of paths
  • The same path in each document leads to the same value and the same JSON type

Implicitly this means that the order of properties in JSON objects does not matter. It is “only” necessary that both objects have the exact same set of paths in any order.

There is no implicit type conversion supported in JSON SQL. The JSON string “15” is considered different from the JSON number 15 as both are of different JSON type.

Sample Data Set

To illustrate equality the following collection compColl is introduced:

{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"}]}
{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"},null]}
{"a":[15,true,{"p":"q"}],"b":["15",true,{"p":"q"}]}
{"x":{"r":15,"s":[true,false]},"z":{"r":15,"s":[true,false]}}
{"x":{"r":15,"s":[true,false]},"z":{"r":15,"s":[[true,false]]}}
{"x":{"r":15,"s":[true,false]},"z":{"r":"15","s":[true,false]}}
{"e":15,"f":[14,15,16]}
{"e":15,"f":[16,15]}
{"e":15}

Sample Queries

An example query for equal JSON arrays is as follows.

select {*} from compColl where a = b

returns

{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"}]}

An example query for equal JSON objects is as follows.

select {*} from compColl where x = z

returns

{"x":{"r":15,"s":[true,false]},"z":{"r":15,"s":[true,false]}}

Inequality is defined as negation of equality. The following queries demonstrate this:

select {*} from compColl where a <> b

returns

{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"},null]}
{"a":[15,true,{"p":"q"}],"b":["15",true,{"p":"q"}]}
select {*} from compColl where x <> z

returns

{"x":{"r":15,"s":[true,false]},"z":{"r":15,"s":[[true,false]]}}
{"x":{"r":15,"s":[true,false]},"z":{"r":"15","s":[true,false]}}

Undefined Comparison Operators <, >, <= and >=

Several comparison operators are undefined for JSON array and JSON object: <, >, <= and >=. If during query processing these comparison operators are used in combination with JSON array and/or JSON object, then the JSON documents will not participate in the comparison and will not add any result document to the result set.

The following query demonstrates that only like JSON types are compared:

select {*} from compColl where a.[0] <= b.[0]

returns

{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"}]}
{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"},null]}

The following query demonstrates that the <= comparison on JSON array is not defined:

select {*} from compColl where a <= b

returns the empty result.

The reason that those four comparison operators are not implemented is that not all JSON types can be compared with each other. For example, a JSON Boolean and a JSON number cannot be compared and consequently the comparison of JSON array or JSON object might fail and return an undefined result. In fact, across all JSON types, only JSON string can be compared to JSON string and JSON number to JSON number by >, <, <= and >=; all other JSON type cannot be compared with each other or other JSON types with these operators.

In context of query processing a failing comparison operator would not be desirable as the query would fail. As a consequence, JSON SQL does not implement the four comparison operations <, >, <= and >= on JSON array and JSON object (actually, on any JSON type except JSON number and JSON string).

However, a user can compare JSON arrays and JSON objects by comparing their array elements or properties individually where applicable or necessary. This is called user-defined comparison and is based on individual restrictions.

User-defined Comparison

A user defines comparison by means of predicates. This supports the user in comparing only those JSON array elements or JSON object properties that need to be compared for the use case at hand and make sense in this context: a user is not forced to compare all JSON array elements or all JSON object properties by can do so selectively.

select {*} from compColl where a.[2].p >= b.[2].p

returns

{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"}]}
{"a":[15,true,{"p":"q"}],"b":[15,true,{"p":"q"},null]}
{"a":[15,true,{"p":"q"}],"b":["15",true,{"p":"q"}]}

As the example shows, there are JSON array elements or JSON object properties that cannot be compared, e.g., a.[2] or b.[2] (except for equal and not-equal).

Since JSON SQL supports JSON documents with varying schema, a user can ensure the presence and JSON type of certain properties that are relevant for comparison with the predicates exists_path and is_of_type. The former ensures the presence, the latter type compatibility.

The following query shows all documents where the property f does not have a second array index but has a property e. If a query compares e with f.[1] then this query shows which documents will not participate in the query.

select {*} 
from   compColl 
where  exists_path e 
       and not exists_path f.[1]

returns

{"e":15}

The analogous is possible with the is_of_type predicate that would show which documents are excluded because of type incompatibility.

Missing Paths

As the example shows, only those JSON documents are participating in the comparison that fulfill the constraints wrt. existence and type compatibility.

A missing path does not falsify the result, as this example shows: the document is simply not participating in the comparison:

The query

select {*} from compColl where e = f.[1]

returns

{"e":15,"f":[14,15,16]}
{"e":15,"f":[16,15]}

Total Order

A total order across all document in a collection is only possible if each document can be compared with every other document in the same collection. With the predicates exists_path and is_of_type it is possible to determine if any documents will be left out of a comparison and hence the documents of a collection cannot be totally ordered with the given predicates.

Summary

Even though the operators >, <, >= and <= cannot be implemented for several JSON types, clients can implement partial comparison of documents with combinations of individual restrictions. The predicates exists_path and is_of_type allow to determine the set of documents included in (or excluded from) the query.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 12: SERVER-831

There is an almost infamous bug in context of MongoDB, called SERVER-831. This blog puts SERVER-831 in context of JSON SQL.

Upfront

In order to not loose the context, the bug and one of the use cases were captured as PDFs and are attached here:

Anatomy of SERVER-831

Server 831 has two main aspects: a selection combined with an update. In the following only the selection is discussed; the update part follows down the road when update in context of JSON SQL will be introduced.

Basically, the selection selects based on the value of array elements without knowing the precise array index where the values are located. The array index is not part of the selection criteria, only the content of array elements.

Data Set

These are the documents in the collection server831 used for illustration in this blog:

{"a":[5,4,3,2,1]}
{"a":[5,10,15,20,25]}
{"a":[1,2,3,4,5]}
{"a":[{"_id":7},{"_id":8}]}
{"a":[{"_id":8},{"_id":7}]}
{"a":[null,[0,0,7],null]}
{"a":[true,false],"b":true}
{"a":[true,false],"b":false}
{"a":[true,[null],false]}

Predicating Array Elements by Array Index

So far it was possible to create a selection based on the array index. For example, the following query returns all documents that contain an array “a” where the value at index 0 is 5.

select {*} from server831 where a.[0] = 5

returns

{"a":[5,4,3,2,1]}
{"a":[5,10,15,20,25]}

Predicating Array Elements by Content

Sometimes there is the situation where the array index is unknown and the client wants to retrieve all documents where at least one array element has a specific value. Referring to every array element is denoted as [*] in the JSON SQL syntax (instead of referring to a specific array index).

For example, the following query selects all documents where at least one of the array elements of “a” has the value 5.

select {*} from server831 where a.[*] = 5

results in

{"a":[5,4,3,2,1]}
{"a":[5,10,15,20,25]}
{"a":[1,2,3,4,5]}

More interesting queries are the following examples that reach into the value(s) of array elements:

select {*} from server831 where a.[*]._id = 7

returns

{"a":[{"_id":7},{"_id":8}]}
{"a":[{"_id":8},{"_id":7}]}

Not only scalar literals can be used, but non-scalar literals as well.

select {*} from server831 where a.[*] = [0,0,7]

returns

{"a":[null,[0,0,7],null]}

Selection does not have to be based on literals:

select {*} from server831 where a.[*] = b

returns

{"a":[true,false],"b":true}
{"a":[true,false],"b":false}

The latter allows comparison of values inside documents.

Any number of levels can be referenced, for example

select {*} from server831 where a.[*].[*] = null

returns

{"a":[true,[null],false]}

SERVER-831 Use Case

The document provided in one of the use cases in SERVER-831 is as follows (with cleaned up syntax errors):

{ "_id": 1, 
  "name": "Dave Gahan", 
  "medications": [ 
  { "_id": 23, 
    "name": "Dilaudid", 
    "type": "Rx", 
    "prescriptions": [ 
    { "_id": 13, 
      "quantity": 60, 
      "started": "2009-01-01" }, 
    { "_id": 77, 
      "quantity": 45, 
      "started": "2009-02-01" } ] }, 
  { "_id": 41, 
    "name": "Oxycodone", 
    "type": "Rx" } ]}

The corresponding query in form of JSON SQL is (assuming the data is stored in collection uc831):

select {*} 
from uc831 
where _id = 1 
      and medications.[*]._id = 23 
      and medications.[*].prescriptions.[*]._id = 77

Summary

Selection based on array element values without the knowledge of the array index is an extremely powerful query feature of JSON SQL and is most likely useful in many different use cases.

The syntax of JSON SQL had to be modified only minimally without any restriction or loss of query expressiveness.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 11: JSON Specific Predicates

JSON SQL cannot assume collections containing only homogeneous documents; instead, it must support schemaless documents. This requires JSON specific predicates that are discussed in this blog.

Example Collection

The documents of the following collection “predColl” are used in this blog as examples.

{"a":"b"}
{"a":{"c":1,"d":2},"e":[77,{"x":"eightyeight"}]}
{"a":{"c":1,"d":2},"e":["seventyseven",{"x":88}]}

Definition: Path

A (complete) path in a JSON document is the list of property names and array indexes from the root all the way to a leaf separated by “.”. For example, a complete path is

e.[1].x

A partial path in a JSON document is the list of property names and array indexes from the root to an intermediary property (that is not a leaf property) separated by “.”. For example, a partial path is

e.[1]

It is possible that a path is a complete path in one document and a partial path in another document. For example, a path that is complete and partial at the same time is

a

Each path leads up to a property that is of a specific JSON type. For example,

  • the type of e.[1].x is JSON_STRING in one of the example documents and JSON_NUMBER in another one.
  • the type of e.[1] is JSON_OBJECT in both cases.

Document: Set of Paths

Based on the above definition, a JSON document can be represented as the set of its paths.

The document

{"a":{"c":1,"d":2},"e":["seventyseven",{"x":88}]}

can be represented as

a
a.c
a.d
e.[0]
e.[1]
e.[1].x

The corresponding types are

JSON_OBJECT
JSON_NUMBER
JSON_NUMBER
JSON_STRING
JSON_OBJECT
JSON_NUMBER

Schemaless Documents / Schema-per-Document

Traditionally, databases enforced the definition of a schema and many continue to do so. This means that data can only be stored successfully in a database if the data complies with the schema at that point in time.

Some database systems do not enforce a schema and as a consequence the data is not constrained by a schema. In context of JSON documents, if a document is not constrained by a schema, then it can have any valid structure (in terms of the rules the JSON standard defines).

These documents are termed “schemaless” documents. In reality there is a schema, an implicit  (intentional) one, for each document, termed “schema-per-document”.

In concrete terms “schemaless” in context of a database system means:

  • Any pair of documents within a collection of a database can potentially have different sets of paths
  • The same (complete or partial) path in two different documents can lead to a value of a different JSON type

As a consequence, queries must be able to test for the existence of paths as well as for the existence of a specific JSON type as these cannot be assumed (in contrast to a database that enforces schemas).

JSON Specific Predicates: exists_path and is_of_type

Two predicates are needed in order to check for the existence of paths or JSON types of values.

  • exists_path <path>
  • <path> is_of_type <JSON type>

The query

select {*} from predColl where exists_path a.d

returns

{"a":{"c":1,"d":2},"e":[77,{"x":"eightyeight"}]}
{"a":{"c":1,"d":2},"e":["seventyseven",{"x":88}]}

The query

select {*} from predColl where e.[1].x is_of_type JSON_number

returns

{"a":{"c":1,"d":2},"e":["seventyseven",{"x":88}]}

Negation

Negation of the predicates is useful to determine if there are document that do not have specific paths or paths do not refer to properties of specific JSON types.

select {*} from predColl where not exists_path e.[1].x

returns

{"a":"b"}

The query

select {*} from predColl where not a is_of_type JSON_object

returns

{"a":"b"}

Definition: (Partial) Homogeneity

Queries using the predicates exists_path or is_of_type can be used to determine if all documents have the required or expected paths or if there are documents that are missing specific paths or have paths leading to unexpected JSON types.

With these predicates it is possible to determine if the documents in a collection are homogeneous wrt. given set of paths.

Dynamic Schema Check

A developer can now implement first checking path existence or JSON type compliance before executing business logic related queries. If the required paths or JSON types are missing in specific documents, appropriate error handling can be implemented for these.

From a different viewpoint, a developer now has the tools to dynamically check or select all documents complying to a schema that is “imagined” at the type of querying.

Summary

While schemaless documents are convenient from a development perspective, this requires in general the ability to check for the existence or absence of paths as well as JSON type compliance.

The predicates exists_path and is_of_type provide the necessary querying tools in order to test for the variations in schemaless JSON documents.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

SQL for JSON Rationalization Part 10: Complex Restriction

Until this blog only single restrictions were used; here is the discussion of complex restrictions in JSON SQL.

Example Data Set

In the following the collection “comrescoll” (for complex restriction collection) is used and it contains the following JSON documents:

{"a":{"c":"foo"},"b":[true,false,null]}
{"a":{"c":"foo"}}
{"b":[true,false,null]}

Single Restriction Semantics

As defined in a previous blog, a document only participates in a single restriction if the referenced properties are present.

For example,

select {*} 
from comrescoll 
where a.c = 'foo'

returns

{"a":{"c":"foo"},"b":[true,false,null]}
{"a":{"c":"foo"}}

as only those two object have the property “a.c” defined.

As the blog will show, a complex restriction applies to a document if all properties requiring evaluation are present.

Note that this previous sentence was really carefully worded. The following would not be correct: a complex restriction applies to a document if all properties that are referenced in the complex restriction are present.

The reason will be discussed later in detail.

AND

In an ANDed complex restriction, all properties that require evaluation have to be present in the document. A document will be included into the result set if the complex restriction evaluates to true for the document. The following query shows this case:

select {*} 
from comrescoll 
where a.c = 'foo' and b.[1] = false

returns

{"a":{"c":"foo"},"b":[true,false,null]}

… WHERE true, … WHERE false

A restriction always resulting in true can be implemented like this:

select {*} 
from comrescoll 
where 5 = 5

No complex restriction is required to accomplish this.

A restriction always resulting in false can be implemented like this:

select {*} 
from comrescoll 
where a.c = ‘foo’ and a.c = ‘bar’

However, an easier way is

select {*} 
from comrescoll 
where 5 = 6

This case does not require a complex restriction either.

OR

The OR operator behaves as expected:

select {*} 
from comrescoll 
where a.c = 'foo' or b.[1] = false

returns

{"a":{"c":"foo"},"b":[true,false,null]}
{"a":{"c":"foo"}}
{"b":[true,false,null]}

In case of a disjunction not all paths have to actually be present for a correct execution:

select {*} 
from comrescoll 
where d.[1] = false or a.c = 'foo'

returns

{"a":{"c":"foo"},"b":[true,false,null]}
{"a":{"c":"foo"}}

Even though the path “d.[1]” is not present, the evaluation takes place and is correct as the absence of a path does not necessarily fail the execution in a disjunction if this part of the complex restriction is unnecessary.

To force the presence of the path “d.[1]”, a predicate like “exists_path” would have to be added like this:

select {*} 
from comrescoll 
where d.[1] = false or a.c = 'foo' and exists_path d.[1]

This is not implemented yet, but it will be down the road at some point in time.

NOT

The operator NOT behaves as expected. The query

select {*} 
from comrescoll 
where not a.c = 'foo' and b.[1] = false

results in

{"b":[true,false,null]}

This might be surprising initially as the result contains a document that does not contain the path “a.c”. However, the absence of the path “a.c” means that “a.c” does not have the value ‘foo’.

Like above, the presence of “a.c” would be enforced by a conjunction of exists_path “a.c”.

Combination of AND, OR, NOT

The operators AND, OR and NOT can be combined as expected according to the SQL 92 standard. For example, the query

select {*} 
from comrescoll 
where not a.c = '' and not b.[0] = 0 or not b.[1] = 1

results in

{"a":{"c":"foo"},"b":[true,false,null]}
{"a":{"c":"foo"}}
{"b":[true,false,null]}

Caveat

Remember the “careful wording” from earlier? Consider the following query:

select {*} 
from comrescoll 
where a.c = 'foo' and b.[1] = false or d = 6 or d <> 6

On the surface of it, the paths “a.c”, “b.[1]” and “d” need to be present for the complex restriction to be evaluated. However, this query returns the following result:

{"a":{"c":"foo"},"b":[true,false,null]}

Upon closer inspection, the two restrictions involving the path “d” are insignificant. The path “d” is not needed for the evaluation even though the path “d” is referred to in the complex restriction.

Another example is

select {*} 
from comrescoll 
where a.c = 'foo' and b.[1] = false or (d = 6 and d <> 6)

resulting in

{"a":{"c":"foo"},"b":[true,false,null]}

What if the path “d” should be present? In this case a predicate like exists_path is necessary (to be discussed in one of the future blogs). With the additional predicate exists_path only those documents are included in the evaluation that actually contain that path.

Summary

Complex restrictions in JSON SQL are very similar to Relational SQL in their syntax and semantics except for the difference that paths that are not required for the evaluation of a complex restriction do not have to be present in the documents even though they are mentioned in the complex restriction.

Go [ JSON | Relational ] SQL!

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.