Document Collections: Size and Processing Complexity

“My collection is 100 GB in size, can you process it?”. What does that mean and what is really significant about it?

Collection Count and Sizes

A collection of 100GB can mean many things. For example, it could be a collection with one document of the size of 100GB. Or a collection of 100 documents, with 1GB in size each.

MongoDB, for example, has a document size upper limit of 16MB (http://www.mongodb.org/display/DOCS/Documents). Documents larger than 16MB have to be split into several documents (or stored outside the document model).

Does this all matter? From a storage perspective probably less as the storage needs to be able to store the data (in this case 100GB). If 100GB is broken down into smaller documents, then this requires more storage due to the additional management data needed, but that is insignificant.

From a processing perspective it definitely matters.

Processing Complexity

One measure of processing complexity is the number of operations executed per document. For example, if a property is read from each document, it takes O(n) operations (one per document), meaning the complexity is linear with the number of documents.

One possible optimization here is if that particular property is indexed. In this case the complexity is sub-linear if the database does not have to access every document for the property value by can use the index instead.

If more complex operations have to be executed for each document (like counting sub-collection documents, or adding up values within a sub-collection), then the complexity increases by a constant factor.

If documents are divided up and have to be joined together before they can be processed (e.g because of size limits or de-normalization), then the processing complexity might increase significantly.

What to ask for?

The storage size matters, but it is not a good measure for the processing effort. What matters for processing effort is the number of documents, existing indexes, as well as any required “combination” due to de-normalization in addition to the actual computation to be performed on documents.

Next time the size is characterized by a storage measure, ask also for the measures that determine the processing complexity.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s