On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:
When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.
Hadoop: The Problem of Many Small Files originally posted on the NoSQL blog: myNoSQL