A paper authored by a Microsoft Research team:
In the last decade we have seen a huge deployment of cheap
clusters to run data analytics workloads. The conventional wisdom
in industry and academia is that scaling out using a cluster of
commodity machines is better
for these workloads than scaling up by adding more resources to a single server.
Popular analytics infrastructures such as Hadoop are aimed at such a cluster
scale-out environment. Is this the right approach?
The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.
Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:
Storage bottlenecks can easily be removed either by using SSDs or by
using one of many scalable back-end solutions (SAN or NAS
in the enterprise scenario, e.g. , or Amazon S3/Windows Azure
in the cloud scenario). In our experimental setup which is a
small cluster we use SSDs for both
the scale-up and the scale-out machines.
First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.
The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.