NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MPI: All content tagged as MPI in NoSQL databases and polyglot persistence

Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed

I’m not the only one confused by Michael Stonebraker’s Hadoop is dead theme. Edward Capriolo:

Let me tell you a story of how I got into hadoop and hive. I was following advice like Stonebreaker’s that said Parallel DBs are the way to go. But I quickly found out Parallel Database are too rich for my blood.  Now, I am not telling you or anyone else that you should not spend money on Parallel DBs, because maybe you have the money, or maybe you need some of those things the parallel database provides. But for things I need to do:

  • store tons of data
  • processed it reasonably fast
  • be LOW on the cost scale

Hadoop and hive work fine for me.

Original title and link: Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed (NoSQL database©myNoSQL)


What other popular paradigms/architectures can handle large scale computational problems?

Interesting answers on Quora mostly expanding on Krishna Sankar’s short answer:

There are two ways one can address large scale computational problems:

  • Task Parallelism : This is where MPI and so forth fit in
  • Data Parallelism : This is the sweet spot for map/reduce

Original title and link: What other popular paradigms/architectures can handle large scale computational problems? (NoSQL database©myNoSQL)

Combining Hadoop MapReduce and MPI for Terascale Learning

Trying to combine MPI and Hadoop MapReduce for eliminating the drawbacks in each of them:

  1. MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
  2. MapReduce: Conceptual simplicity. One easy to understand function is enough.
  3. MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
  4. MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
  5. MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
  6. MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
  7. MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
  8. MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

Original title and link: Combining Hadoop MapReduce and MPI for Terascale Learning (NoSQL database©myNoSQL)