Document-oriented NoSQL Databases: Oracle NoSQL’s Shift in March 2014

I believe Oracle NoSQL extended and expanded from a key/value store to a document-oriented NoSQL database in March 2014. If you concur, then this was a major shift in the NoSQL arena of document-oriented databases.

Document-Oriented NoSQL Databases

In a narrow interpretation, document-oriented NoSQL databases use JSON or BSON as the query interface data representation. In a wider interpretation, any scalar data type and any composite data type (like maps or array) is available at the query interface without those necessarily complying to the JSON encoding or interpretation.

Some of the known document-oriented database are MongoDB, CouchDB and RethinkDB, amongst many others. The site http://db-engines.com/en/ranking/document+store has a categorization and ranking readily available.

In addition to storing documents (aka, complex data structures with scalar, array and map data types), another important criteria is the ability to query based on the document content by e.g. selecting documents where specific properties have specific values or comply to specific complex predicates (in the general case).

Key/Value Stores

Key/value stores (k/v stores) are different from document-oriented databases. Key/value stores persist a value for a given key. Retrieval is based on the key and values cannot be used in query predicates. Some k/v stores have structured keys or composite keys that support querying a range or set of values.

Oracle NoSQL, for example, is supporting structured and composite keys (http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/singleget.html, http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/multiget.html).

Aerospike is another example of a k/v store (https://docs.aerospike.com/display/V3/Key-Value+Store+Guide).

K/V Stores: Major Shift

Oracle NoSQL extended its data model from a pure k/v data model to a document-oriented data model in March 2014.

The available simple data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Double
  • Float
  • Integer
  • Long
  • Java byte array
  • Java String

And the available complex data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Array (An array of values, all of the same type)
  • Enum (An enumeration, represented as an array of strings)
  • Fixed Binary (A fixed-sized binary type used to handle binary data where each record is the same size. It uses less storage than an unrestricted binary field, which requires the length to be stored with the data)
  • Map (An unordered map type where all entries are constrained by a single type)
  • Records

In addition, indexes can be created on the ‘value’ part, aka, the documents: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/createindex.html. And based on indexes, the database can be queried without having to specify a key: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/indexread.html.

The latter functionality extends the characteristics of the Oracle NoSQL database from k/v store into a document-oriented NoSQL database. Clearly secondary indexes are not yet a very expressive query language, however, they are a major first step towards it.

What does it mean?

The interesting aspect is that Oracle as a major database vendor moved with a separate and dedicated NoSQL product first into the k/v store space, and now additionally into the document-oriented NoSQL space. There are many ways to read this move, of course, but from a technical viewpoint this emphasizes the importance of a structured data model in the NoSQL space. In addition, OracleNoSQL allows a per-document Avro schema, making it a truly per-document schema database.

Aerospike, for example, also supports complex data types and secondary indexes, even though it is also widely seen as a k/v store (e.g., http://db-engines.com/en/ranking/key-value+store).

Summary

Clearly, the document-oriented segment of NoSQL databases gets more and more crowded and the attention of major players in the database space.

What hasn’t been done is a comparison of the data model and query interface expressiveness of the major players in the document-oriented NoSQL database segment. Once done, and in case that reveals that the expressiveness is the same or almost the same, then the systems are exchangeable from a functionality viewpoint and will distinguish themselves ‘only’ in terms of non-functional properties like latency, throughput, transactions, cost, support, and so on.

Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

Relational Data in a Document-oriented NoSQL Database (Part 4): Part-Of Relationship

The part-of relationship is an interesting relationship in context of document-oriented data models and requires a separate discussion, albeit short.

Part-Of Relationship

The part-of relationship is between two objects. There is the owning object (owner), and the object that is part of the owner (part). A part is tightly linked to the owner, meaning, that the part cannot exist on its own but only when the owner is in place at the same time. An owner can have many parts, including zero. From an object model this means that

  • a part can only be created and related to an object (owner) if that object exists (either already or is coming into existence simultaneously with the part)
  • a part can be deleted without the owner having to be deleted
  • if an owner is deleted, all its parts will be deleted, too (parts cannot exist on their own)

The relationship between a part and its owner is therefore not symmetrical. But a part cannot exist on its own and therefore this asymmetric relationship must be properly enforced in order to ensure data consistency.

(Of course, a variant is possible, where the part can exist on its own. In this case the part has an identity of its own. On top, if the part can only be part of one owner, there then must be a reservation model that reserves/releases the part so that it cannot be reused twice. This variant is often also modeled as a role: ‘object plays the role of a part in another object’. But that variant is non-special in this context, so it will not be discussed further).

Relational Model

If a part is scalar then it is usually represented as a separate column in the row representing the owner. If a part is represented in a separate table (or set of tables) then the row that represents the owner has a foreign key referring to the part. Representing parts in their own tables is usually necessary when the part is complex and cannot be represented in a scalar.

Removing a scalar part is done by setting it to ‘null’ (or some other specifically called out value representing absence). If the owner is deleted, the scalar parts are automatically deleted as part of the same row.

If a part is a relation (or several relations) then the owner refers to the part through a foreign key. However, if the owner is deleted the system has to ensure that its part(s) will be deleted also. If a relational database management system does not have the functionality to supervise and enforce this (cascading deletion) automatically, the application systems have to ensure this data consistency.

Document-oriented Model

In a document-oriented model complex parts can be implemented analogously to the relational approach. A part can be a separate document from the owner document and the owner refers to the part (by document identifier or other unique identifier).

However, in the document-model it is also possible to add the part (no matter how complex it is) as a property or collection inside the document representing the owner. In this case, if the owner is removed, the part will be removed with it automatically. The supervision and enforcement is therefore guaranteed.

Summary

While it is possible to implement the part-of relationship similarly in the relational and document-oriented model, the document-model also provides the option to have complex parts inside the document representing the owner. From a modeling perspective and from a design perspective this is a great option to have.

From a processing perspective this means that accessing the owner automatically provides all the parts. This can be really useful if the parts are needed when the owner is accessed. However, if the parts are always separately needed, then making them part of the owner document might cause too much data movement upon access. So the alternative modeling approach of having the parts and the owners separate is more advantageous in this case.