NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



What can Big Data mean to healthcare?

Eric Baldeschwieler:

If you had such a system, learning and hypothesis testing in medicine would accelerate tremendously! Some examples: You could do quality of care of outcome metrics on all procedures over time and ID best practices; You could quickly look for interesting correlations between infected patents to ID causes of medical conditions or at least correlations; You could datamine all reported illnesses against all unusual gene sequences; You could correlate activity, demographics, diet with outcomes and ID best practices; Etc, Etc, Etc. You would see an amazing acceleration of learning, quality of care, patent specific therapies, …

Unfortunately on the receiving side of all these extremely private data are entities that, at least in some parts of the world, don’t have the best reputation and are very focused on their own profit.

Original title and link: What can Big Data mean to healthcare? (NoSQL database©myNoSQL)


Pig cheat sheet

Cheat sheet? Check. Pig? Check. Where do I get it?


Some of my favorite data visualization resources

Pretty much everything that contains the words visualization and data in the title is getting my attention. Moreover so if it promises a list of resources that could help me learn a bit of the art of visualization. Aaron Cordova’s list contains books, sites, and tools:

Visualization is more art than science at this point, although some have used it enough to be able to identify successful techniques for various purposes. Most successful visualizations are perhaps more dependent upon the decision or task at hand than the actual original structure of the data.

Original title and link: Some of my favorite data visualization resources (NoSQL database©myNoSQL)


The birth and road ahead of TokuMX, the alternative MongoDB engine

While not a MongoDB user (or expert), I find Tokutek’s work on their alternative engine for MongoDB, TokuMX, quite interesting both for technical — what is currently broken in MongoDB — and business point of views — is the InnoDB model possible in the NoSQL space?, what are some possible outcomes of the alternative core technology for free products business model?, would a new product bringing together MongoDB’s missing features and combining them with MongoDB’s “friendliness” and product marketing still lead to a successful product?, etc.

Zardosht Kasheff’s post about the history of TokuMX and how the decision was made to pursue this direction brings some light to both these areas.

But really, the BIGGEST benefit to this approach was the following: we could innovate on more of the MongoDB core server stack in ways the other approaches would not allow. Prior to TokuMX 1.4, such innovations include (but are not limited to):

  • Document level locking
  • Multi-statement transactions (on non-sharded clusters)
  • MVCC snapshot query semantics
  • Clustering indexes (although, to be fair, this was possible in other approaches)
  • Dramatically reduced I/O utilization on secondaries (which we will elaborate on in a future post)
  • Fast bulk loading
  • Enterprise hot backup

For these reasons, we chose this option, and after some hard work, TokuMX was born.

Original title and link: The birth and road ahead of TokuMX, the alternative MongoDB engine (NoSQL database©myNoSQL)


The evolution of scalable NoSQL at Viber [sponsor]

The story of Viber’s NoSQL expedition that took them from MongoDB to Couchbase Server, told in this sponsored post by Couchbase:

As one of the fastest growing VoIP services in the world, the challenge at Viber has been to build and maintain a scalable architecture that is capable of sustaining exponential growth. The first-generation architecture was built on top of a custom, in-memory database. However, within months, the database could no longer keep up with the growth the company was experiencing.

Viber DB architecture - 1st generation

The second-generation architecture was built on top of MongoDB shards. Next, Redis was added as a cache on top of the MongoDB shards to increase throughput. Still, MongoDB was unable to meet the high throughput requirements. Finally, a second Redis cluster was added independent of the MongoDB shards. The second-generation architecture was compromised of 150 MongoDB nodes and over 100 Redis nodes.

Viber DB architecture - 2nd generation

The third-generation architecture had to support 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term. Viber chose to build their third-generation architecture on top of Couchbase Server. The third-generation architecture is compromised of 100 to 120 Couchbase Server nodes.

Viber DB architecture - 3rd generation

Viber has been able to reduce the number of database nodes required while increasing the throughput with their third-generation architecture. For example, one of their Couchbase clusters (a ten node cluster) handles 100,000 to 200,000 operations per second with over 4.5 terabytes of data.

See the full story on the Viber switch.

Original title and link: The evolution of scalable NoSQL at Viber [sponsor] (NoSQL database©myNoSQL)

The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)


Aerospike CTO on benefits of Flash and Open sourcing [sponsor]

Brian Bulkowski, CTO of Aerospike, supporters of myNoSQL, talks on theCUBE about in-memory and Flash based databases, the benefits of open source, and some other interesting topics:

In this 16 minutes interview with theCUBE, Brian Bulkowski talks about:

  1. Big Data, benefits and trends in-memory and Flash databases
  2. the recently released Aerospike driver for Node.js
  3. a soon to be open source tool for recommendations based on non-contextual systems
  4. benefits and the customer push for open source

Original title and link: Aerospike CTO on benefits of Flash and Open sourcing [sponsor] (NoSQL database©myNoSQL)

HBase block caches - Optimizing for random reads

Great post by Nick Dimiduk1 covering the whats, whys, and hows of caching data blocks in HBase, the mechanism through which HBase is optimizing random reads2:

There is a single BlockCache instance in a region server, which means all data from all regions hosted by that server share the same cache pool. The BlockCache is instantiated at region server startup and is retained for the entire lifetime of the process. Traditionally, HBase provided only a single BlockCache implementation: the LruBlockCache. The 0.92 release introduced the first alternative in HBASE-4027: the SlabCache. HBase 0.96 introduced another option via HBASE-7404, called the BucketCache.

  1. Nick Dimiduk works at Hortonworks and is the co-author of HBase in Action

  2. For optimizing recent edits, HBase has another mechanism, the MemStore

Original title and link: HBase block caches - Optimizing for random reads (NoSQL database©myNoSQL)


NoSQL Shouldn’t Mean NoDBA

Nick Heudecker (Gartner):

The results were largely what I expected, except for the respondent profile. Database administrators (DBAs) appear to be significantly underrepresented in the NoSQL space, representing only 5.5% of respondents

Question here is why is this happening? Keeping in mind the survey’s audience is “NoSQL adopters”, I’m wondering what combination of the following explains the results:

  1. DBAs see no value in NoSQL
  2. DBAs see no job security
  3. DBAs see a drop in their revenue with NoSQL
  4. DBAs are misinformed
  5. DBAs are change resistant (putting them in the later phases of adoption)

I’d go with a combination of 5 (explained mostly by 3) and 4.

Original title and link: NoSQL Shouldn’t Mean NoDBA (NoSQL database©myNoSQL)


Storage technologies at HipChat - CouchDB, ElasticSearch, Redis, RDS

As per the list below, HipChat’s storage solution is based on a couple of different solutions:

  • Hosting: AWS EC2 East with 75 Instance currently all Ubuntu 12.04 LTS
  • Database: CouchDB currently for Chat History, transitioning to ElasticSearch. MySQL-RDS for everything else
  • Caching: Redis
  • Search: ElasticSearch
  1. This post made me wonder what led HipChat team to use CouchDB in the first place. I’m tempted to say that it was the master-master replication and the early integration with Lucene.
  2. This is only the 2nd time in quite a while I’m reading an article mentioning CouchDB — after the February “no-releases-but-we’re-still-merging-BigCouch” report for ASF. And according to the story, CouchDB is on the way out.

Original title and link: Storage technologies at HipChat - CouchDB, ElasticSearch, Redis, RDS (NoSQL database©myNoSQL)


Behind our databases - The illustrated history of programming languages

A masterpiece:


1983 - Bjarne Stroustrup bolts everything he’s ever heard of onto C to create C++. The resulting language is so complext that programs must be sent to the future to be compiled by the Skynet artificial intelligence. Build times suffer. Skynet’s motives for performing the service remain unclear but spokespeople from future say “there is nothing to be concerned about, baby,” in an Austrian accented monotones. There is some specuation that Skynet is nothing more than a pretentious buffer overrun.

Original title and link: Behind our databases - The illustrated history of programming languages (NoSQL database©myNoSQL)


The cloud landscape described, categorized, and compared

Fantastic article by Johan Den Haan:

In this article I will explain this framework. I will also explain how I constructed this framework, give some example technologies/solutions for each cell of the framework, and show how this framework can be used to compare some of the popular cloud platforms (e.g. OpenStack, AWS, Heroku, CloudFoundry).


Not only does it provide good details about Database-as-a-Service and even Business Analytics Platform-as-a-Service, but it also shows how these are higher building blocks of object storage and software defined storage.

Original title and link: The cloud landscape described, categorized, and compared (NoSQL database©myNoSQL)