On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:
When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.
This blog is called myNoSQL and it is written by me, Alex Popescu, a software architect with a passion for open source and communities.
It records my readings, learnings, and opinions on NoSQL databases, polyglot persistence, and distributed systems -- subjects that I'm passionate about.
The opinions expressed here are my own, and no other party necessarily agrees with them.
If you feel I'm biased, I probably am.