Trending: Multi-Interface and Multi-Data-Model Databases

An interesting development, especially in the NoSQL database space, is the development towards multi-interface and multi-data-model databases, and sometimes both at the same time. While it provides flexibility, it also brings challenges.

Multi-Data-Model Support

In the relational database space, supporting different data models concurrently is not a novelty. Relational databases started off with the relational data model implementation, and later on some of the systems extended the relational model mainly by objects, XML or JSON.

Some databases in the NoSQL space are starting to evolve, too, in this manner and are providing more than one data model concurrently. Some interesting examples are discussed next.

One example in the NoSQL space is Oracle NoSQL [http://www.oracle.com/technetwork/database/database-technologies/nosqldb/]. This system supports a key/value data model whereby the key is used to identify values that are not interpreted by the database itself. In addition, values can be of complex types that are actually interpreted by the database, e.g., in secondary indexes.

Aerospike is another example in the NoSQL space [http://www.aerospike.com/]. Aerospike provides a data model consisting of basic and complex types. In addition, it supports language-native formats as well as large data types that have a specific operational characteristics and data type operations tuned for scale.

Like some relational databases extended data models over time to support specific use cases in a more direct way, some NoSQL databases are also going down that path to more directly support specific application developer needs.

Multi-Interface Support

From an application development perspective a single query API is certainly preferable that provides the complete query expressiveness required. However, especially in the new area of NoSQL databases, it is not clear yet what a good query API actually looks like. This can be observed by different systems providing different query API alternatives.

MongoDB [http://www.mongodb.org/] has a document query interface based on query patterns in form of JSON documents (“Query Documents”). In addition, it provides a map/reduce interface and aggregation pipelines. There are three different APIs that an application developer can choose, and, in addition, they overlap in their functionality. This means that, depending on the query, it can be expressed in all three of them.

Aerospike [http://www.aerospike.com/] provides different language drivers in addition to an Aerospike Query Language.

Cloudant [http://www.cloudant.com], in contrast, supports a REST-api as well as a query interface based on query documents (similar to MongoDB).

Not strictly an external interface, but very important for specific use cases, is the ability to add functionality dynamically to the database in order to move some processing from the application systems into the database itself: user defined functions. For example, MongoDB allows adding functionality through JavaScript functions, whereas Aerospike supports two different types of Lua functions: record user defined functions operate on single records, whereby stream user defined functions support distributed processing.

The Good

Unquestionably, the good part about multi-interface and multi-data-model databases is that an application developer can choose the best combination of data model and access interface for a particular development task. The impedance mismatch between the problem and the solution can be minimized with an appropriate choice.

This also means that developers need to understand the pros and cons of every combination and that requires to go through a learning curve. Going through that learning curve might pay off big time.

In addition, application development teams will have to manage a wider range of implementation variations in terms of application design and engineering, but also in terms of bug fixing and application code maintenance.

The Tricky

The tricky part of multi-interface and multi-data-model databases is that all combinations can be used concurrently, in production as well as post-production (e.g. analytics). Unit and functional tests as well as performance and scale tests become a lot more complicated as they have to test the concurrent execution of various combinations.

Furthermore, many queries can be expressed in different interfaces as those tend to overlap in query expressiveness. So an application developer needs to clearly understand the pros and cons of the query execution that underlies a specific query interface.

Hopefully the query semantics is the same for all combinations (meaning, for example, that predicates are evaluated the same way) and that concurrent use of data models and interfaces does not negatively impact the various clients in terms of concurrency, scale and performance. Any bug introduced through a discrepancy might be very difficult to reproduce and fix.

Summary

While multi-interface and multi-data-model databases are a powerful technology, there is considerable impact to the application system engineering activities in terms of knowledge acquisition, development, test and maintenance.

While database vendors certainly strive to have all combinations work in harmony, there might be edge cases where one combination does not give the same result as a different one. From an application development perspective test coverage should ensure semantic equivalence of the used combinations so that misinterpretations or wrong results are avoided.

Document-oriented NoSQL Databases: Oracle NoSQL’s Shift in March 2014

I believe Oracle NoSQL extended and expanded from a key/value store to a document-oriented NoSQL database in March 2014. If you concur, then this was a major shift in the NoSQL arena of document-oriented databases.

Document-Oriented NoSQL Databases

In a narrow interpretation, document-oriented NoSQL databases use JSON or BSON as the query interface data representation. In a wider interpretation, any scalar data type and any composite data type (like maps or array) is available at the query interface without those necessarily complying to the JSON encoding or interpretation.

Some of the known document-oriented database are MongoDB, CouchDB and RethinkDB, amongst many others. The site http://db-engines.com/en/ranking/document+store has a categorization and ranking readily available.

In addition to storing documents (aka, complex data structures with scalar, array and map data types), another important criteria is the ability to query based on the document content by e.g. selecting documents where specific properties have specific values or comply to specific complex predicates (in the general case).

Key/Value Stores

Key/value stores (k/v stores) are different from document-oriented databases. Key/value stores persist a value for a given key. Retrieval is based on the key and values cannot be used in query predicates. Some k/v stores have structured keys or composite keys that support querying a range or set of values.

Oracle NoSQL, for example, is supporting structured and composite keys (http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/singleget.html, http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/multiget.html).

Aerospike is another example of a k/v store (https://docs.aerospike.com/display/V3/Key-Value+Store+Guide).

K/V Stores: Major Shift

Oracle NoSQL extended its data model from a pure k/v data model to a document-oriented data model in March 2014.

The available simple data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Double
  • Float
  • Integer
  • Long
  • Java byte array
  • Java String

And the available complex data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Array (An array of values, all of the same type)
  • Enum (An enumeration, represented as an array of strings)
  • Fixed Binary (A fixed-sized binary type used to handle binary data where each record is the same size. It uses less storage than an unrestricted binary field, which requires the length to be stored with the data)
  • Map (An unordered map type where all entries are constrained by a single type)
  • Records

In addition, indexes can be created on the ‘value’ part, aka, the documents: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/createindex.html. And based on indexes, the database can be queried without having to specify a key: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/indexread.html.

The latter functionality extends the characteristics of the Oracle NoSQL database from k/v store into a document-oriented NoSQL database. Clearly secondary indexes are not yet a very expressive query language, however, they are a major first step towards it.

What does it mean?

The interesting aspect is that Oracle as a major database vendor moved with a separate and dedicated NoSQL product first into the k/v store space, and now additionally into the document-oriented NoSQL space. There are many ways to read this move, of course, but from a technical viewpoint this emphasizes the importance of a structured data model in the NoSQL space. In addition, OracleNoSQL allows a per-document Avro schema, making it a truly per-document schema database.

Aerospike, for example, also supports complex data types and secondary indexes, even though it is also widely seen as a k/v store (e.g., http://db-engines.com/en/ranking/key-value+store).

Summary

Clearly, the document-oriented segment of NoSQL databases gets more and more crowded and the attention of major players in the database space.

What hasn’t been done is a comparison of the data model and query interface expressiveness of the major players in the document-oriented NoSQL database segment. Once done, and in case that reveals that the expressiveness is the same or almost the same, then the systems are exchangeable from a functionality viewpoint and will distinguish themselves ‘only’ in terms of non-functional properties like latency, throughput, transactions, cost, support, and so on.

Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

SQL: The Kingmaker of NoSQL Databases?

Here is an interesting question: what if all NoSQL and NewSQL databases would implement SQL?

Advanced SQL

“SQL” in this context does not refer to SQL 92 (http://en.wikipedia.org/wiki/SQL-92) or other versions of the purely relational model focused standard, but to a SQL that is extended properly in syntax and in semantics to deal with non-relational data models. Let’s name it “Advanced SQL” for the purpose of this blog.

I furthermore assume that all data in a database would be accessible, not just parts of it. For example, in key/value stores the values’ internal structure would be accessible as long as it contains data accessible in a structured way (e.g., a JSON representation).

Ideally, of course, all NoSQL and NewSQL databases would implement the same new form of Advanced SQL, but for this blog this is not really necessary or a precondition (but it would be nice, though).

If such an Advanced SQL would be available on all NoSQL and NewSQL databases, then

  • the typical SQL operators like select, project, join, and so on are available on all data in a database across all data models like relational, document-oriented, key/value, columnar, graph-oriented.
  • the same data set loaded into each NoSQL or NewSQL database would be accessible through Advanced SQL and the same query would return the same result, no matter which database (best case assumption wrt. SQL execution semantics).

Data Models

In order to be able to load all data sets into all types of databases the data models have to be equivalent in their structural expressiveness. This might not necessarily be the case today, however, the common trend appears to be that databases are either supporting JSON, or structured data containing array types and nested sub-structures alongside elementary types like string and integer.

For the sake of the blog, let’s assume the data models are equivalent in structural expressiveness.

Implication of Advanced SQL on NoSQL / NewSQL Databases

The biggest implication of Advanced SQL on NoSQL / NewSQL databases would be that all NoSQL / NewSQL databases would be equivalent from a functional perspective in terms of accessing data: a given data set could be stored in any of the databases and queried using SQL. A query would (ideally) return the same result in any of the databases storing the same data set.

The big question this bears is: what would distinguish the NoSQL/NewSQL databases in this case?

  • The non-functional aspects.

For example, to name a few:

  • Performance: the same query would take different execution time
  • Transactions: a set of queries would form a single transaction (or not)
  • Scale: the speed and extent a database can scale up/scale down with varying data set size
  • Backup/restore: how fast and how reliable can databases be re-created after disaster

This basically would mean that a user can select a database based on its non-functional characteristics. (Of course, history has shown that the particular SQL implementation of a database will have proprietary extensions also).

Benefits of Advanced SQL

There are several benefits if Advanced SQL like described above would be available.

  • A user can select not only one, but several databases for different needs, while not having to deal with vastly different query interfaces. For example, one for production system access speed and a different one for long-term storage or analytics.
  • Data can be distributed across different systems depending of how the data is used without having to deal with incompatible access interfaces.
  • NoSQL and NewSQL databases can be integrated alongside existing RDBMSs for mutual benefit.

In a nutshell, users are not forced anymore to select one databases for all their needs, but can select the appropriate one for the particular part of their system architecture.

Trending of Non-functional Features

There is an interesting trend ongoing. OldSQL and NewSQL databases start supporting JSON (e.g., Postgres, VoltDB); Informix has a compatible driver interface to the MongoDB wire format; Tokutek and GridGain implement non-functional replacements for MongoDB; Oracle NoSQL supports per-document Avro schemas for the values; Tokutek adds transaction support; and so on. This looks like the early stages of ‘sorting out’ on a data model level as well as non-functional level. Time will tell if this was the starting point of consolidation in the NoSQL/NewSQL space.

King Making

So why, if the differentiation is likely to happen on the non-functional level, is SQL a potential kingmaker? In my opinion, Advanced SQL will be the standard expectation and a database not supporting Advanced SQL will be relegated to being a niche system. So while the presence of Advanced SQL will be a distinguishing factor (possibly a small one), the absence of Advanced SQL will impact the success of a database significantly.

Document Projection (Part 2): Definition

What does projection in context of JSON structures or documents actually mean? What should the definition of “projection” be? There are several possibilities discussed next.

Document Projection: Complete Branch

Projection in JSON is projecting a branch of the JSON data structure tree, not projecting a single object or value. To request a projection, a property (projection) path in dot notation could be used (and actually is in many systems). The result of a projection is a valid JSON document containing the specified branch(es).

An example document is

{"a": {"b": {"c": 3, "d": 4, "e": 5}}}

Let’s go through a series of projections in the following.

  • Projection path: “a.b.c”
  • Result: {“a”: {“b”: {“c”: 3}}}
  • Projection path: “a.b”
  • Result: {“a”: {“b”: {“c”: 3, “d”: 4, “e”: 5}}}
  • Projection path: “a.e”
  • Result: {}

The result contains the full path of the projection (or more, but not less). If the requested projection path does not exist, the result is the empty document as none of its properties matches the projection path. The empty projection path “” is not defined, meaning, a projection must name at least one property, and that will be a top-level property in a document.

Several branches can be projected concurrently.

  • Projection paths: “a.b.c”, “a.b.d”
  • Result: {“a”: {“b”: {“c”: 3, “d”: 4}}}

The resulting document contains the combination of all of the branches that result in a valid projection. Redundant projection path specification is possible if one of the projection paths is a sub-path of another one. However, the result document is the same if redundancy is present or absent.

Document Projection: Partial Branch

It might be possible that the whole projection path does not exist, but a part of it. In this case it is a possibility to add the existing result up to that point (MongoDB follows this approach). This results in partial paths whereby the value of their last property is the empty document.

For example, “a.b.f” would result in {“a”: {“b”: {}}}. “a” and “b” exist in the example document, “f”, however, does not.

In my opinion, while possibly useful in some cases, I would not make this the default or standard definition as a result is returned that is incomplete and I could argue that it is in fact incorrect since the value of “b” is not the empty document (I could envision a configuration setting that provides these partial branches if needed).

Document Projection: Value

Wait a minute, why does the result document have to consist of full paths?

The reason is based on the implicit restriction on JSON documents that there can be only one property of a given name on the same level in a document. “Implicit” because the JSON definition (http://json.org/) does not mandate the restriction, but many implementations do: property names on the same level of embedding have to be unique.

For example:

{"x": {"b": {"c": 3, "d": 4}}, 
 "y": {"e": {"c": 3, "d": 4}}}

is a perfectly valid document where the property names are unique on every level. So let’s get back to projection and let’s for a moment assume that projection actually returns the value at the end of the path, omitting the path to the property value itself. So,

  • Projection path: “x.b.c”
  • Result: {“c”: 3}

So far so good.

  • Projection paths: “x.b.c”, “y.e.c”
  • Result: ?

What should the result be? The basic assumption is that a projection on a document returns a document. But “x.b.c” and “y.e.c” both return {“c”: 3} as separate documents, but not one document.

  • One possible result could be an array with two documents. But arrays are in general not considered valid top level documents (again, the JSON definition would allow that).
  • To mitigate that, the array could be the value of a property: {“result”: [{"c": 3}, {"c": 3}]}. But this would conflict with a document that happens to have a property “result” of type array with two same documents in it.
  • Or, the two documents would have to be embedded in a third one with special names, like {“1″: {“c”: 3}, “2″: {“c”: 3}}. But then, the original document does not have the properties “1″ or “2″.

Based on this discussion having projection results being full paths is simpler and direct.

Projection – Result Correspondence Principle

There is also another argument from the user’s viewpoint. If a user wants to project “x.b.c”, then the user might want to access the result document after the query returns with “x.b.c” as the access path. From this viewpoint, the path in the query and the path in the result document should match and not require access path transformation.

Array Projection: Complete Access Path

Documents may contain arrays as well as array of arrays, arrays of objects of arrays, etc., in principle any of these composite types can be on any level of the document. Projection therefore has to be defined on arrays also, not just documents.

The principle of project paths is extended to include array index specification. For example, let’s consider this document:

{"a": [{"a1": 1}, {"a2": 2}], 
 "b": {"c": [{"c1": 3}, {"c2": 4}, {"c3": 5}]}, 
 "d": [6, 7]}

Let’s do a few projections (arrays are 0-index based):

  • Projection path: a[0]
  • Result: {“a”: [{"a1": 1}]}
  • Projection path: b.c[1]
  • Result: {“b”: {“c”: ["c2": 4]}}
  • Projection paths: a[1], b.c[2].c3
  • Result: {“a”: [{"a2": 2}], “b”: {“c”: [{"c3": 5}]}}
  • Projection path: a[7]
  • Result: {}

Like in the case of documents, full paths are requested and full paths are returned, with several paths possible. A projection path referring to a non-existing property will not contribute to the result.

So far, so good, except that the results do not yet conform to the “Projection – Result Correspondence” principle from above: the projection “a[1]” resulted in a correct document, but that result document cannot be accessed with “a[1]” to obtain the value.

Array Projection: Padding

In order to support the “Projection – Result Correspondence” principle array results can be padded with the value “null”. For example:

  • Projection paths: a[1], b.c[2].c3
  • Result: {“a”: [null, {"a2": 2}], “b”: {“c”: [null, null, {"c3": 5}]}}

Now it is possible to access the result with “a[1]” or “b.c[2].c3″ in order to obtain the proper results. From a user’s perspective this is great as again the paths used to specify the projection can be used to retrieve the values.

Array Projection: Scalar Values

Scalar values in arrays do not pose a specific problem:

  • Projection paths: a[1], d[1], d[2]
  • Result: {“a”: [null, {"a2": 2}], “d”: [null, 7]}

And their access can be accomplished using the projection paths.

Summary

Initially I thought projection is a straight forward function and not worth a discussion in context of document-oriented databases; but then it turned out to be not that clear cut. Nevertheless, the above is a starting point for a strict rationalization of projection in document-oriented databases based on the JSON data model.

Document Projection (Part 1): MongoDB

This blog reviews some of the projection functionality that the MongoDB query interface provides. The emphasis is comparing projection of embedded object properties with projection of embedded array elements. Those are not symmetric, as the examples will show, and that is surprising and rather unexpected.

Test Data Set

The initial test data set contains three documents:

{}
{"a": 1, "b": 2}
{"c": {"d": 3, "e": 4, "f": {"g": 5, "h": 6}}}

The empty document is the control document, and two documents have properties whereby one of those has several levels of embedding. The test data set, contained in a file “td.txt”, is loaded into MongoDB as follows:

mongoimport -d projection -c proj td.txt

Projecting Document Properties

Let’s observe a few projection queries:

> db.proj.find({}, {_id: 0, a: 1})
{  }
{ "a" : 1 }
{  }

In general, to make the result less verbose, the property “_id” is suppressed. The query asked for property “a” and for each document the property “a” is returned (whereby two documents do not contain “a”, so the resulting documents are empty).

> db.proj.find({}, {_id: 0, "c.e": 1})
{  }
{  }
{ "c" : { "e" : 4 } }

Reaching into documents is done by using the dot notation. “c.e” asks for the property “e” in “c”. The result contains the whole document structure starting at “c” all the way to “e”. Alternatively MongoDB could have returned {“e”: 4} only, but then the result would not correspond to the dot notation in the query.

> db.proj.find({}, {_id: 0, "c.f": 1})
{  }
{  }
{ "c" : { "f" : { "g" : 5, "h" : 6 } } }

No surprise here as the result contains the property “c.f”.

> db.proj.find({}, {_id: 0, "c.f.h": 1})
{  }
{  }
{ "c" : { "f" : { "h" : 6 } } }

Reaching further into the document works as expected also.

> db.proj.find({}, {_id: 0, a: 1, c: 1})
{  }
{ "a" : 1 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }

Asking for different properties returns all of those for each document.

> db.proj.find({}, {_id:0, "c.e":1, "c.f.h":1})
{  }
{  }
{ "c" : { "e" : 4, "f" : { "h" : 6 } } }

Asking for different properties in the same embedded document returns a combination of the properties, not separate properties. This makes sense as in each document a property with a given name can only appear once.

> db.proj.find({}, {_id: 0, "c.g": 1})
{  }
{  }
{ "c" : {  } }

This query asks for an embedded property that does not exist. However, the properties on the path to that missing property are actually included in the result. This is surprising to me as I would expect that if a property does not exist, no property is included in the result at all, especially not a partial property, so to say. In this case, the path in the query “c.g” does not match any path in the result.

Projecting Array Element Properties

First of all, another document is added to the test data set that contains arrays and nested document that in turn contain an array:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}]}

So a total of 4 documents are in the test data set now.

Projection of arrays is done using the “$slice” operator described here: http://docs.mongodb.org/manual/reference/operator/projection/slice/ (in addition to predicate-driven selection/projection using “$” or “$elemMatch”). Let’s try.

> db.proj.find({}, {_id: 0, "x": {$slice: 1}})
{  }
{ "a" : 1, "b" : 2 }
{ "c" : { "d" : 3, "e" : 4, "f" : { "g" : 5, "h" : 6 } } }
{ "x" : [ { "y" : 10 } ] }

This asks for one element from the array “x”, and indeed, the first element is returned. However, all properties of every document not containing “x” are returned also. This is surprising and in contrast to the behavior when projecting properties of embedded documents.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: 1}})
{ "x" : [ { "y" : 10 } ] }

The “$exists” operator limits the documents to those that contain “x” only. This is still not the same semantics as in the document property projection, but closer.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x": {$slice: [1, 1]}})
{ "x" : [ { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

This selects the second array element (the first “1″ indicates the number of skips, and the second “1″ indicates how many elements should be selected.

This is different from selecting the second property, in my opinion, as in case of projecting the second property it would be important to see in the result that the second property was projected. This definitely debatable, by in analogy to projecting embedded document properties, the result would have to reflect the query.

Let’s try to select the second element of “w”. This requires reaching into the array on the first level.

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.1.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }

The approach using the dot notation fails. The query does not honor “x.1.w”, specifying: project from “w”, which is the second array element “x.1″. However, the interface is not giving an error, either, as it probably should?

> db.proj.find({x: {$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }

This works. MongoDB seem to automatically interpret this correctly. However, the first array element of “x” is returned also, again in contrast to the approach for document properties where properties that are not on they path will not be returned.

But what if “x” would contain an additional array element with property “w”? Let’s add this document:

{"x": [{"y": 10}, {"w": [{"z": 11}, {"v": 12}]}, {"w": [{"z": 13}, {"v": 14}]}]}

Now 5 documents are in the test data set.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.w": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "v" : 12 } ] }, { "w" : [ { "v" : 14 } ] } ] }

Turns out, MongoDB selects all documents that contain a property “w”, and from those the second element. This looks reasonable.

But, how is the second element of the first “w” selected? I think multi-level projection in context of arrays is not possible at this point: https://jira.mongodb.org/browse/SERVER-831.

> db.proj.find({x:{$exists: true}}, {_id: 0, "x.q": {$slice: [1,1]}})
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "w" : [ { "z" : 13}, { "v" : 14 } ] } ] }
{ "x" : [ { "y" : 10 }, { "w" : [ { "z" : 11 }, { "v" : 12 } ] }, { "a" : 20 } ] }

Asking for an non-existing property returns the complete array, again, in contrast to the analogous query in embedded documents.

I’ll stop here at examining projecting array elements as clearly there is limited support for it as this point in time.

Summary

Surprisingly, at least for me, MongoDB does not follow the same design rules when projecting properties from embedded documents compared with projecting elements from embedded arrays. Not only is the behavior different

  1. Properties of documents not containing the requested array projection are returned
  2. Properties not on the path to the projected element are returned

but also multi-level projection does not work in a straight forward way: it is not possible to use dot notation to reach into nested arrays for projection.

Querying for non-existing properties results in partial or incorrect results, in my opinion. Of course, there are different viewpoints possible on this behavior, and for sure it warrants further discussion.

Schema-free Database (Part 2): Relational Database Management System (RDBMS)

As outlined in Part 1 of this series (http://realprogrammer.wordpress.com/2013/11/02/schema-free-database-part-1-an-oxymoron/), a ‘schema-free database’ is an oxymoron and in fact the notion of schema is changing from a more restrictive to a more flexible interpretation in context of NoSQL database technology.

So it is only consequential to ask the question the other way around (as a thought experiment): is it possible to build a relational database management system that does not enforce a schema, and if so, how would such a system look like on an abstract level?

Yes, it is possible to have a non-schema-enforcing RDBMS. Let’s discuss two variations next.

Definition of No-Schema-Enforcing Relational Database Management System

What functionality would be altered in order to provide a no-schema-enforcing RDBMS? If it were possible to create a table without specifying columns (aka, only a table name), and then to insert, update and delete rows, then a ‘schema-free’ RDBMS would be in place. This would mean in detail:

  • Rows do not have to comply to a schema when inserted into a table. Different rows in the same table could have different attributes (columns) and the same attributes of different rows could have different domains (flexible type system).
  • By defining a table without specifying columns (names and domains), a table would not define a primary key, either (neither a simple, nor a composite key). Applications inserting or updating rows can behave nicely and add properties with values that comply to the primary key semantics, but the RDBMS would not be aware of it and consequently would not enforce primary key compliance.
  • By the same token, foreign keys would not be enforced by the RDBMS for the same reasons.
  • Since no primary key enforcement is in place, duplicate rows will not be prevented by the RDBMS and any supervision is left to the application systems.
  • Indexes are independent of schema specification and assuming that indexes are maintained on tables, not all rows might be present in an index if the attributes defined by the index are not contained in a row.

Surprisingly (or not), defining a no-schema-enforcing RDBMS is pretty straight forward.

Variation on No-Schema-Enforcing RDBMS

An interesting variation of a no-schema-enforcing RDBMS could be that a schema, primary keys, foreign keys, etc., are specified as usual, however, without being actively enforced; instead, warnings are given by the RDBMS. For example, a row not complying to the schema can actually be inserted, but the result would not be a ‘OK’, but a warning indicating a schema violation.

This can be described as a ‘middle ground’ in widening the schema interpretation where the RDBMS is aware of a schema and warns of violations without rejecting the various DML operations.

Characterization of No-Schema-Enforcing RDBMS

Could a no-schema-enforcing RDBMS (any of the variations) be a useful database management system? Yes, as it would be the equivalent (on the relational model) to NoSQL databases (on JSON/BSON model or key value model).

For use cases where the flexible schema interpretation is key, such a no-schema-enforcing RDBMS could fit the bill (possibly better) than a NoSQL database system if the use case is fundamentally relational in nature (as opposed to e.g. hierarchical or key/value) and if SQL as the query language is important.

Further Exploration

There are additional areas in a RDBMS that will have to change their behavior in a no-schema-enforcing implementation. Only briefly (and not exhaustively), these are

  • Triggers. Triggers are specified on tables and state changes of rows. If particular attributes are referenced inside the trigger, then not every update, insert, read or delete will execute the trigger logic.
  • Stored procedures. Stored procedures often have parameters of specific types and assume a specific set of attributes when processing rows. In a no-schema-enforcing situation the stored procedure has to be able to deal with variations of rows.
  • Functions and function extensions. Functions have to be changed similarly to stored procedures. Not only from the viewpoint of parameters, but also the processing logic.
  • Aggregation. Aggregation will have to change in various ways as the various aggregation functions cannot assume that all attributes are of the same type. Neither can they assume that all attributes are actually present in all rows of a table.

In principle, every concept and every implementation aspect of a RDBMS needs to be re-examined wrt. a wider and more flexible interpretation of ‘schema’. NoSQL systems, by their definition and approach, started with a wider interpretation and consequently made all the conceptual and implementation decisions. They are one source of approach in this regard.

Contact Me

If you plan to explore or to build a no-schema-enforcing RDBMS, please contact me.