Document-oriented NoSQL Databases: How many Joins will you have to implement?

One of the continuously debated items in context of NoSQL databases is the join operation. Let’s listen in a bit:

and there can be many more variations found on the topic of joins on various levels of technical depth.

So, do we need joins in context of NoSQL databases? Do we do joins implemented by NoSQL databases? Are joins outdated concepts that we can live without in context of NoSQL databases? In this blog I try to rationalize the overarching question in principle. Some fact finding first:

(Database) Data Models and Database Management Systems

Data models, like the relational model, the document-model, the hierarchical model, key-value model, graph model, object-oriented model, XML model, etc., are implementations of data structures in a given database management system. Data models define possible data types and their construction rules for more complex types.

For example, the implementation of a relational model might restrict values in tables to be scalar. Another implementation might allow a table as a value, supporting NF2 relations. One system might support the document-model strictly following the JSON model, while others add additional data types in addition to what JSON defines. Some systems do support the notion of references, other so not. Each database implements a data model in any variation it likes to.

Schemata and Database Management Systems

A schema is a particular extension of a domain model, implemented in context of a data model. For example, a domain model might be suppliers, parts and their relationship. This can be implemented in a relational model, a document model or a graph model or any other supported data model.

There is no ‘best’ way of definition a schema. For the same domain, different schemata can be defined depending on the skill of the creator, the knowledge of query access patterns, the amount of restrictions that should be supervised by the database management system and other factors.

For example, in a document model, suppliers, parts and their relationships can be modeled as three separate documents, or in two documents (suppliers and their relationship to parts), or one document – and there are many more variations possible, of course.

Joins and Database Management Systems

Some database management systems implement the join operation in their query interface, some do not. For example, Oracle, MySQL and FoundationDB implement joins, MongoDB, Oracle NoSQL and Aerospike do not. So joins are not necessarily restricted to the relational data model.

Joins and Data Access Paths

With the fact finding under our belt, how many joins will you have to implement? In principle, this is a function of the required data access based on a specific schema. Different schemata of the same domain will require a different number of joins.

Let’s look at a few examples in the supplier – parts domain.

Example 1: No join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts":[
     {"part_name": "part_lowQual"}, 
     {"part_name": "part_hiQual"}]
}

The query: “find the names of all parts for a supplier” does not require a join as the data is already structured so that each supplier contains the set of all parts it supplies.

Example 2: One join required

The documents are structured like this:

{"supplier": "superQuality",
 "parts": [1, 2]
}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}

The query: “find the identifiers and names of all parts for a supplier” requires a join as a supplier only has the identifiers of the parts it ships, not their names.

Example 3: Two joins required

The documents are structured like this:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}

The query: “find the identifiers and names of all parts for a supplier” requires two joins, one to find the objects for a supplier that relate the part identifier to the supplier identifier, and a second one to find the corresponding parts.

Analysis of Examples

The examples have shown empirically that the need for joins is not a function of the data model (document-oriented in this case), but a function of the data access, aka, the number of required data relationship traversals in context of a given schema. If the relationship to be traversed matches the way the data is structured as in Example 1, no join is necessary. As soon as the data is structured differently from the required traversal by the query, joins are necessary (Example 2 and 3).

So, as summary, it is fairly easy to avoid joins. If, and only if, you can structure your data (aka, build your schema) in such a way that it conforms structurally to the queries then you can avoid joins completely (Example 1). I am certain that there are special cases out there for which you can accomplish that, but in general, this is not possible. And, even if it is possible in production, as soon as analysts start analyzing the data sets, they will most likely query along different access paths.

Joins at Query Time vs. Joins at Insert/Update/Delete Time

Above examples clarified that joins are a function of the data access paths. Can joins at query time be avoided entirely by creating data access paths in a certain way?

Yes, it is possible, however, it is a basic trade-off between data query and data manipulation time: reducing the computational effort at run-time, and instead increasing it during insert / update / delete operations. In principle, joins at query time can be avoided if for each access path there is an equivalent data structure in place.

Example 4: Schema refactoring

The documents in this example look like:

{"supplier": "superQuality", "supplier_id": "S_55"}
{"part_name": "part_lowQual", "part_id": 1}
{"part_name": "part_hiQual", "part_id": 2}
{"part_id": 1, "supplier_id": "S_55"}
{"shipper": "fastShipper", "shipper_id": "SH_01"}
{"part_id": 2, "shipper_id": "SH_01"}

Supplier supply parts, however, shippers ship not any part, but only specific parts (maybe for safety reasons). There can be several queries against this document set:

  • Find all parts supplied by a supplier with a given name
  • Find all parts shipped by a shipper with a given name
  • Find all suppliers and shippers for a part with a given name

Each of these queries requires at least one join. The documents can be restructured easily to avoid joins altogether:

{"supplier": "superQuality", "supplier_id": "S_55",
 "parts": [
     {"part_name": "part_lowQual", "part_id": 1}
]}
{"shipper": "fastShipper", "shipper_id": "SH_01",
 "parts": [
     {"part_name": "part_hiQual", "part_id": 2}
]}
{"part_name": "part_lowQual", "part_id": 1,
 "suppliers": [
     {"supplier": "superQuality", "supplier_id": "S_55"}
 ], 
 "shippers": []}
{"part_name": "part_hiQual", "part_id": 2,
 "suppliers": [],
 "shippers": [
     {"shipper": "fastShipper", "shipper_id": "SH_01"}
]}

The idea is clear: structure the data in such a way that a query can be satisfied with a simple selection. And, the consequence is clear, too: data is duplicated, possibly many times. Which means that an insert, update or delete has to know all the locations where to modify the data and has to modify the data consistently (and ideally within a single transaction).

As a side note, this is the situation that normalization tries to address by ensuring that each data item is only once in the database.

Of course, data duplication will have an impact on the size requirements of main memory an disk space. While there is a change in algorithm complexity, there is also a change in the storage and memory size requirements.

Pre-Joining Data

Pre-joining data allows to avoid joins at query time at the cost of duplicating data at data management time. Alternatively expressed, the implementation of duplication at management time is the cost of avoiding normalization combined with query-time joins.

Is there a way to quantify the effort? In principle, there are as many duplications necessary as joins are to be avoided. This is a rough estimate as many joins are the same except for selection and/or projection specifications. If all joins are abstracted to their join criteria (omitting projection and selection), then this is roughly the amount of duplication required.

The article written by Sarah Mei clearly shows the trade-off between data duplication and joins: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/. She clearly describes many of the issues in context of a specific use case.

“Wait a minute, I don’t have joins and it works anyway!”

But, where are the joins? NoSQL databases that do not implement the join operator in their query interface are in use and production.

If not expressed as query, joins are found either in the application system logic or the interface logic, depending on the design. Most likely these are nested-loop joins or hash-based joins (less likely) or a series of selections with the application logic combining the intermediary query results into the final result data set.

And they are not joins on the complete data set either, but usually have some selection criteria. So the application system logic roughly corresponds to the optimized operator tree of a database query sub-system and in all actuality there might be many joins implemented that way throughout the application logic.

The joins are in fact implemented, just not by using a join operator on the database interface, but inside the application logic. This means that the database cannot optimize the execution, plus there are several queries coming from the application logic putting load on the database system.

And this opens up yet another trade-off: data duplication vs. application logic complexity. If the data is structured in such a way that joins are avoided (at the cost of duplication), then the application logic complexity will be reduced also (from algorithms implementing joins to algorithms issuing queries with selections/projections).

Of course, while the application logic complexity is reduced, the data management logic complexity increased as it has to manage duplicate data consistently across the database.

Summary: Are joins required? Yes. Are joins implemented? Yes.

In my mind there is no question that joins are in general needed and actually implemented today, even if the database does not support a join operator directly and even if there are opinions that joins are not needed. I don’t really understand why there is a discussion about this in the first place as the need for a join is a function of the data schema, not the data model.

The fact that a relational database has the capability of joins does not mean you must use it. And the fact that a NoSQL database does not support joins at their query interface does not mean joins are not needed.

At the heart an architecture and engineering decision has to be made (implicitly or explicitly) of how many joins are implemented through data duplication and how many joins are implemented through algorithms in the application logic layer (if there is not join operator available at the database query interface).

It’s that easy.

 

Advertisements

JSON is strictly By-Value

JSON is an externalization format, not a programming language data type implementation. Why is this relevant?

JSON

JSON (http://www.json.org) is an ASCII representation of data. It provides base data type representations and a grammar about how to structure JSON structures properly.

As for terminology, here are some synonyms often used:

  • JSON “object”: document
  • JSON “members”: properties
  • JSON “pair”: property
  • JSON “string” in “pair”: property name
  • JSON “value” in “pair”: property value

JSON is Syntax

It is syntax only. There is not data type semantics attached to it; all programming languages that process JSON define their own interpretation of the meaning of the syntax JSON defines.

JSON Semantics as such is Undefined

The semantics of JSON is not defined by JSON itself. It is undefined by the standard. Programming languages and databases have to define their (!) semantics of JSON.

For example, the JSON standard does not make a statement about unique property names in a document. In JSON terminology, an object (aka document) can have several pairs (aka properties), each having a string and a value (aka property name and property value). JSON does not constrain the property names to be unique. According to JSON it would be valid for a document to have several properties with the same name.

JSON Semantics Implementation

How is this implemented? Do systems actually allow several properties with the same name in a document? Let’s look at two of those.

MongoDB (version 2.0.2): it is possible to save a document that has the same property twice:

MongoDB shell version: 2.0.2
connecting to: test
> use json
switched to db json
> db.test.save({"a":1, "a":2})
> db.test.findOne()
{ "_id" : ObjectId("5006d58e92cca1a32772df6a"), "a" : 2 }
>

So, clearly MongoDB does not complain about the fact that a property name is stated twice. However, it opts to make a selection itself and uses the second of the two and actually stores only one. Of course, this bears the question if MongoDB on input applies other modifications as well.

JsonLint (http://jsonlint.org/) exposes the same behavior. When validating {“a”:1, “a”:2} then two things are happening: it deems the input as ‘Valid JSON’, but then it modifies the input to only {“a”:2}.  This is bad in two ways. First, it is not clear which version is ‘valid’, and second, it is not a read-only lint! This means that if you paste in a JSON document, always check if JsonLint modified it; your document might not be valid, but JsonLint’s modification of it.

So it seems like that some systems actually do not support having two properties with the same name. But then I would suggest they flag this as error and not modify the document.

JSON and References

How does a JSON document refer to a second one? For example, an object representing a user referring to an object representing an address?

JSON does not have the notion of reference or pointer nor the notion of address or unique identifier (both necessary to make referencing work). In order to uniquely identify a document, the author of the document has to add a property that by convention is deemed to be a unique identifier (see e.g. MongoDB above, the database adds a ‘_id’ property). Secondly, a reference to such a unique identifier is a regular property that by convention has to be understood as being a reference. For example, the value 5006d58e92cca1a32772df6a above could be a value. By convention the interpretation of this property would know that the number is actually a unique identifier.

Side note: this is the same situation relational databases are in, they don’t have references either. But, in their case there is the concept of key and foreign key supervised by the database, both relying on data values. Except for the supervision functionality, JSON could be interpreted this way, too.

Coming back to the semantics for a moment. Assume a property “ref” is used to indicate that its value is a reference to another document. How is the absence of link established, meaning, the notion that a given object does not refer to another one? On possible way is to set “ref” to the value null. Alternatively “ref” could be left out. Either (or both ways) are possible, but that needs to be established.

JSON and Linked Structures

In summary, in order to establish linked structures using JSON, special properties have to be called out in order to establish addressing and referencing. JSON itself does not provide concepts for this. In addition, the JSON semantics and interpretation has to be established (either explicit by rules or implicit by the programming code).

Engineers and architectus be aware: JSON is an ASCII syntax for externalization, not a programming language data type system.

Relational Data in a Document-oriented NoSQL Database (Part 1): Universal Relation

Is there an equivalent to the Universal Relation in the document-oriented database world? There is: the Universal Collection.

Universal Relation

‘In relational databases, the universal relation assumption states that one can place all data attributes into a (possibly very wide) table, which may then be decomposed into smaller tables as needed.’ (from: http://en.wikipedia.org/wiki/Universal_relation_assumption).

There is a bit more behind that statement; for example, there is an assumption that the same type of data are stored in a column with the same name. E.g., if there is the concept of a ‘street name’, then all street names will be in a single column, probably named ‘street name’.

Another assumption is that if a data set does not have data for all columns of the Universal Relation (and hardly any does), then the value  in the column is ‘null’. So in general the Universal Relation is very sparse and has a lot of null values in its columns.

A comprehensive discussion can be found in http://www.informatik.uni-trier.de/~ley/db/books/dbtext/ullman89.html  (for further detailed exploration).

NoSQL Database Equivalent: Universal Collection

What is the equivalent of a Universal Relation in a document-oriented database? One organization scheme of document-oriented databases are collections of documents. In such a case a first approximation is that all documents are stored in the same collection, a Universal Collection.

The second step towards a Universal Collection is that all property names of the documents that contain the same type of data are named the same. This is in general possible as the naming is scoped by documents and sub-collections in documents.

In contrast to the Universal Relation, a document in a Universal Collection only has to have the properties that have values. Other properties do not have to be added with a ‘null’ value, they can simply be left out as the document model does not have a fixed schema; each document can only contain the properties it really requires. If this approach is taken, the Universal Collection is not sparse at all in comparison with the Universal Relation, but very compact.

Since the document model does not enforce a strictly typed schema it is possible that the same property is of different data types (in the sense of the JSON model). So it is possible that a property called ‘address’ can be a ‘string’ in one document and a sub-collection consisting of several strings in another document with both containing perfect address data. In contrast to the relational model, this is valid (and fine) in a document model (if at all, it causes problems during processing, but not from a data model perspective).

Discussion

On this level, it is certainly possible to define the equivalent of an Universal Relation in a document model: the Universal Collection. This is interesting as later on this will serve as a starting point for normalization.

In reality I have seen projects that actually store all documents in a single collection as it made it easier to query the system in comparison of distributing or partitioning documents across several collections. The question is, of course, are the same types of data stored in properties with the same property name to ensure the semantic equivalence or not. This topic will re-appear in a later blog post about document model definition.