Applying ☞ Amdahl’s law (or in this case a self-deduced variant) to Hadoop provisioning might give you some good answers for questions like:
Credits to Wikipedia
- Why doesn’t the speed of my workflow double when I double the amount of processing power?
- Why does a 10% failure rate cause my runtime to go up by 300%?
- How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
- How many machines should I have in my cluster to be adequately performant and fault-tolerant?
The last scenario we’ve read about in which Hadoop is used is Digg’s data migration from MySQL to Cassandra. At that moment I was wondering why they weren’t using less complex solutions for this migration (f.e. Scribe). Arin has been kind enough to provide an explanation:
we use Hadoop for legacy data and/or when the data needs to go through “big” transformation or denormalization.