Document-oriented NoSQL Databases: Oracle NoSQL’s Shift in March 2014

I believe Oracle NoSQL extended and expanded from a key/value store to a document-oriented NoSQL database in March 2014. If you concur, then this was a major shift in the NoSQL arena of document-oriented databases.

Document-Oriented NoSQL Databases

In a narrow interpretation, document-oriented NoSQL databases use JSON or BSON as the query interface data representation. In a wider interpretation, any scalar data type and any composite data type (like maps or array) is available at the query interface without those necessarily complying to the JSON encoding or interpretation.

Some of the known document-oriented database are MongoDB, CouchDB and RethinkDB, amongst many others. The site http://db-engines.com/en/ranking/document+store has a categorization and ranking readily available.

In addition to storing documents (aka, complex data structures with scalar, array and map data types), another important criteria is the ability to query based on the document content by e.g. selecting documents where specific properties have specific values or comply to specific complex predicates (in the general case).

Key/Value Stores

Key/value stores (k/v stores) are different from document-oriented databases. Key/value stores persist a value for a given key. Retrieval is based on the key and values cannot be used in query predicates. Some k/v stores have structured keys or composite keys that support querying a range or set of values.

Oracle NoSQL, for example, is supporting structured and composite keys (http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/singleget.html, http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/multiget.html).

Aerospike is another example of a k/v store (https://docs.aerospike.com/display/V3/Key-Value+Store+Guide).

K/V Stores: Major Shift

Oracle NoSQL extended its data model from a pure k/v data model to a document-oriented data model in March 2014.

The available simple data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Double
  • Float
  • Integer
  • Long
  • Java byte array
  • Java String

And the available complex data types are (cited from http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/tablesapi.html)

  • Array (An array of values, all of the same type)
  • Enum (An enumeration, represented as an array of strings)
  • Fixed Binary (A fixed-sized binary type used to handle binary data where each record is the same size. It uses less storage than an unrestricted binary field, which requires the length to be stored with the data)
  • Map (An unordered map type where all entries are constrained by a single type)
  • Records

In addition, indexes can be created on the ‘value’ part, aka, the documents: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/createindex.html. And based on indexes, the database can be queried without having to specify a key: http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuideTables/indexread.html.

The latter functionality extends the characteristics of the Oracle NoSQL database from k/v store into a document-oriented NoSQL database. Clearly secondary indexes are not yet a very expressive query language, however, they are a major first step towards it.

What does it mean?

The interesting aspect is that Oracle as a major database vendor moved with a separate and dedicated NoSQL product first into the k/v store space, and now additionally into the document-oriented NoSQL space. There are many ways to read this move, of course, but from a technical viewpoint this emphasizes the importance of a structured data model in the NoSQL space. In addition, OracleNoSQL allows a per-document Avro schema, making it a truly per-document schema database.

Aerospike, for example, also supports complex data types and secondary indexes, even though it is also widely seen as a k/v store (e.g., http://db-engines.com/en/ranking/key-value+store).

Summary

Clearly, the document-oriented segment of NoSQL databases gets more and more crowded and the attention of major players in the database space.

What hasn’t been done is a comparison of the data model and query interface expressiveness of the major players in the document-oriented NoSQL database segment. Once done, and in case that reveals that the expressiveness is the same or almost the same, then the systems are exchangeable from a functionality viewpoint and will distinguish themselves ‘only’ in terms of non-functional properties like latency, throughput, transactions, cost, support, and so on.

Advertisement

Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

JSON is strictly By-Value

JSON is an externalization format, not a programming language data type implementation. Why is this relevant?

JSON

JSON (http://www.json.org) is an ASCII representation of data. It provides base data type representations and a grammar about how to structure JSON structures properly.

As for terminology, here are some synonyms often used:

  • JSON “object”: document
  • JSON “members”: properties
  • JSON “pair”: property
  • JSON “string” in “pair”: property name
  • JSON “value” in “pair”: property value

JSON is Syntax

It is syntax only. There is not data type semantics attached to it; all programming languages that process JSON define their own interpretation of the meaning of the syntax JSON defines.

JSON Semantics as such is Undefined

The semantics of JSON is not defined by JSON itself. It is undefined by the standard. Programming languages and databases have to define their (!) semantics of JSON.

For example, the JSON standard does not make a statement about unique property names in a document. In JSON terminology, an object (aka document) can have several pairs (aka properties), each having a string and a value (aka property name and property value). JSON does not constrain the property names to be unique. According to JSON it would be valid for a document to have several properties with the same name.

JSON Semantics Implementation

How is this implemented? Do systems actually allow several properties with the same name in a document? Let’s look at two of those.

MongoDB (version 2.0.2): it is possible to save a document that has the same property twice:

MongoDB shell version: 2.0.2
connecting to: test
> use json
switched to db json
> db.test.save({"a":1, "a":2})
> db.test.findOne()
{ "_id" : ObjectId("5006d58e92cca1a32772df6a"), "a" : 2 }
>

So, clearly MongoDB does not complain about the fact that a property name is stated twice. However, it opts to make a selection itself and uses the second of the two and actually stores only one. Of course, this bears the question if MongoDB on input applies other modifications as well.

JsonLint (http://jsonlint.org/) exposes the same behavior. When validating {“a”:1, “a”:2} then two things are happening: it deems the input as ‘Valid JSON’, but then it modifies the input to only {“a”:2}.  This is bad in two ways. First, it is not clear which version is ‘valid’, and second, it is not a read-only lint! This means that if you paste in a JSON document, always check if JsonLint modified it; your document might not be valid, but JsonLint’s modification of it.

So it seems like that some systems actually do not support having two properties with the same name. But then I would suggest they flag this as error and not modify the document.

JSON and References

How does a JSON document refer to a second one? For example, an object representing a user referring to an object representing an address?

JSON does not have the notion of reference or pointer nor the notion of address or unique identifier (both necessary to make referencing work). In order to uniquely identify a document, the author of the document has to add a property that by convention is deemed to be a unique identifier (see e.g. MongoDB above, the database adds a ‘_id’ property). Secondly, a reference to such a unique identifier is a regular property that by convention has to be understood as being a reference. For example, the value 5006d58e92cca1a32772df6a above could be a value. By convention the interpretation of this property would know that the number is actually a unique identifier.

Side note: this is the same situation relational databases are in, they don’t have references either. But, in their case there is the concept of key and foreign key supervised by the database, both relying on data values. Except for the supervision functionality, JSON could be interpreted this way, too.

Coming back to the semantics for a moment. Assume a property “ref” is used to indicate that its value is a reference to another document. How is the absence of link established, meaning, the notion that a given object does not refer to another one? On possible way is to set “ref” to the value null. Alternatively “ref” could be left out. Either (or both ways) are possible, but that needs to be established.

JSON and Linked Structures

In summary, in order to establish linked structures using JSON, special properties have to be called out in order to establish addressing and referencing. JSON itself does not provide concepts for this. In addition, the JSON semantics and interpretation has to be established (either explicit by rules or implicit by the programming code).

Engineers and architectus be aware: JSON is an ASCII syntax for externalization, not a programming language data type system.

Relational Data in a Document-oriented NoSQL Database (Part 3): Normalization

Relational database systems support the normalization of relational schemas; what about document-oriented databases?

Normalization

Normalization is a data/schema modeling activity that (if fully applied) results in a schema that has no redundancies and the number of dependencies is minimal (see http://en.wikipedia.org/wiki/Database_normalization for a brief discussion).

In very abstract terms, any change in data is only done in one relation and through means of relationships the rest of the schema stays consistent. This means that in order to support a well-normalized schema queries will contain joins to combine data that is spread out over several tables due to the normalization process.

Document Normalization

Given the concept of normalization in relational systems: is is possible or necessary to normalize documents, too?

First of all, it is possible to normalize data that is organized in documents. The naive approach is to build up the documents so that only scalar properties are stored, roughly equating to tables. In addition, in order to establish references between documents, either foreign-key type properties or documents denoting relationships between documents are created. The process of normalization can be followed like in a relational system.

However, in general document-oriented systems lack the join query operator and so combining documents when querying is not possible. Therefore the combination of data has to be done in the application layer (usually resulting in several queries). This is definitely a down-side of trying to normalize data in document-oriented systems.

Furthermore, the idea behind using document-oriented databases is for storing data that ‘naturally’ consists of hierarchical structures that are fetched together (the 80% case) as complete hierarchies and not normalized chunks of data. And example is a blog and its comments. Comments belong to the blog and without the blog they do not make a lot of sense. In general, when a blog is queried, its comments often are required, too, and so having the blog and its comments in one document makes a lot of sense. The blog can be retrieved in one query.

However, there are cases where separating data into different documents might make sense also. For example, a company might build an e-commerce web site and provides a shopping cart. Users add items to shopping carts. A shopping cart might be represented as one document. In this case it does not make a lot of sense to store the complete item description in every shopping cart that has that item in it. It might be better to have the item name and identifier in the cart with a reference to the item description residing in a different document.

Discussion

While in the relational world it is generally advised to normalize a schema, in the document-oriented world the separation is probably driven by the 80/20 rule of data access. Meaning, to structure documents in such a way that 80% of the queries are fast and fetch all required data in one round trip. Over time, more elaborate rules might be established.

In context of MongoDB a very interesting initial discussion can be found in ’50 Tips and Tricks for MongoDB Developers’ (http://shop.oreilly.com/product/0636920019893.do). This book has a few more elaborated rules and it clearly hints that normalization is for sure an important aspect of document-oriented databases.

Relational Data in a Document-oriented NoSQL Database (Part 1): Universal Relation

Is there an equivalent to the Universal Relation in the document-oriented database world? There is: the Universal Collection.

Universal Relation

‘In relational databases, the universal relation assumption states that one can place all data attributes into a (possibly very wide) table, which may then be decomposed into smaller tables as needed.’ (from: http://en.wikipedia.org/wiki/Universal_relation_assumption).

There is a bit more behind that statement; for example, there is an assumption that the same type of data are stored in a column with the same name. E.g., if there is the concept of a ‘street name’, then all street names will be in a single column, probably named ‘street name’.

Another assumption is that if a data set does not have data for all columns of the Universal Relation (and hardly any does), then the value  in the column is ‘null’. So in general the Universal Relation is very sparse and has a lot of null values in its columns.

A comprehensive discussion can be found in http://www.informatik.uni-trier.de/~ley/db/books/dbtext/ullman89.html  (for further detailed exploration).

NoSQL Database Equivalent: Universal Collection

What is the equivalent of a Universal Relation in a document-oriented database? One organization scheme of document-oriented databases are collections of documents. In such a case a first approximation is that all documents are stored in the same collection, a Universal Collection.

The second step towards a Universal Collection is that all property names of the documents that contain the same type of data are named the same. This is in general possible as the naming is scoped by documents and sub-collections in documents.

In contrast to the Universal Relation, a document in a Universal Collection only has to have the properties that have values. Other properties do not have to be added with a ‘null’ value, they can simply be left out as the document model does not have a fixed schema; each document can only contain the properties it really requires. If this approach is taken, the Universal Collection is not sparse at all in comparison with the Universal Relation, but very compact.

Since the document model does not enforce a strictly typed schema it is possible that the same property is of different data types (in the sense of the JSON model). So it is possible that a property called ‘address’ can be a ‘string’ in one document and a sub-collection consisting of several strings in another document with both containing perfect address data. In contrast to the relational model, this is valid (and fine) in a document model (if at all, it causes problems during processing, but not from a data model perspective).

Discussion

On this level, it is certainly possible to define the equivalent of an Universal Relation in a document model: the Universal Collection. This is interesting as later on this will serve as a starting point for normalization.

In reality I have seen projects that actually store all documents in a single collection as it made it easier to query the system in comparison of distributing or partitioning documents across several collections. The question is, of course, are the same types of data stored in properties with the same property name to ensure the semantic equivalence or not. This topic will re-appear in a later blog post about document model definition.

Relational Data in a Document-oriented NoSQL Database: Overview

This blog starts a series of discussions around the topic of storing relational data in a document-oriented NoSQL database. Each discussion is going to be a separate blog post.

The idea is not to promote storing relational data in a document-oriented database as such. The goal of this series of blogs is to rationalize (to large extent on a blog-level granularity) the relationship between relational and document data and how a relational world can meet a document world (and vice versa).

The starting point of the discussion are the topics in the blog https://realprogrammer.wordpress.com/2012/04/25/relational-database-management-system-rdbms/:

  1. Universal Relation: https://realprogrammer.wordpress.com/2012/05/16/relational-data-in-a-document-oriented-nosql-database-part-1-universal-relation/
  2. Schema: https://realprogrammer.wordpress.com/2012/05/23/relational-data-in-a-document-oriented-nosql-database-part-2-schema/
  3. Normalization: https://realprogrammer.wordpress.com/2012/05/30/relational-data-in-a-document-oriented-nosql-database-part-3-normalization/
  4. Part-Of Relationship: https://realprogrammer.wordpress.com/2012/06/06/relational-data-in-a-document-oriented-nosql-database-part-4-part-of-relationship/
  5. De-Normalization: https://realprogrammer.wordpress.com/2012/06/13/relational-data-in-a-document-oriented-nosql-database-part-5-de-normalization/

More topics might be added as the discussion unfolds.

Based on these discussions I expect that some guidelines will emerge of how to ‘model’ documents in document-oriented databases ‘properly’ and useful criteria for this modeling task. It is also going to be interesting to see if there is a way to determine the relationship between a relational model and a document model in a consistent way.

Modeling documents in a schema-less/dynamic schema world sounds like an oxymoron; however, in the end, transactional applications and analysis software have to access documents and a ‘good’ modeling practice will certainly help those systems in their design and operation.

Document-oriented Database (NoSQL Database Category)

‘Document-oriented database’ refers to databases that store their data in terms of documents. More specifically, structured documents and, by going with the times,  documents in JSON (http://www.json.org/) format at that.

Some examples are MongoDB (http://www.mongodb.org/) or RavenDB (http://ravendb.net/).

These databases belong to a category of databases called ‘NoSQL Databases’ (http://en.wikipedia.org/wiki/NoSQL). This category emerged over the last few years and in this blog site is used in that sense.

Databases that do not prescribe the content format or enforce a specific data model can store JSON by definition; so database systems that did not decide on a particular data representation upfront but left it open, can be considered document-oriented databases in case their payloads are documents.