This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:
From our study, we can see that remote storage powered by the Google Cloud
Storage connector for Hadoop actually performs better than local storage.
The increased performance can be seen in all three of our workloads to
varying degrees based on their access patterns. […] Availability of the
files, and their chunks, is no longer limited to three copies within the
cluster, which eliminates the dependence on the three nodes that contain the
data to process the file or to transfer the file to an available node for
[…] This availability of
remote storage on the scale and size provided by Google Cloud Storage
unlocks a unique way of moving and storing large amounts of data that is not
available with bare-metal deployments.
If you are looking just for the conclusions:
cloud-based Hadoop deployments offer better price-performance ratios than
bare-metal clusters. Second, the benefit of performance tuning is so huge
that cloud’s virtualization layer overhead is a worthy investment as it
expands performance-tuning opportunities. Third, despite the sizable
benefit, the performance-tuning process is complex and time-consuming and
thus requires automated tuning tools.
✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.
Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop