I’m not the only one confused by Michael Stonebraker’s Hadoop is dead theme. Edward Capriolo:
Let me tell you a story of how I got into hadoop and hive. I was following advice like Stonebreaker’s that said Parallel DBs are the way to go. But I quickly found out Parallel Database are too rich for my blood. Now, I am not telling you or anyone else that you should not spend money on Parallel DBs, because maybe you have the money, or maybe you need some of those things the parallel database provides. But for things I need to do:
- store tons of data
- processed it reasonably fast
- be LOW on the cost scale
Hadoop and hive work fine for me.
Original title and link: Hadoop Is the Best Thing Since Sliced Bread, Even if Doomed ( ©myNoSQL)
Interesting answers on Quora mostly expanding on Krishna Sankar’s short answer:
There are two ways one can address large scale computational problems:
- Task Parallelism : This is where MPI and so forth fit in
- Data Parallelism : This is the sweet spot for map/reduce
Original title and link: What other popular paradigms/architectures can handle large scale computational problems? ( ©myNoSQL)
Trying to combine MPI and Hadoop MapReduce for eliminating the drawbacks in each of them:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
Original title and link: Combining Hadoop MapReduce and MPI for Terascale Learning ( ©myNoSQL)