Storing small files is a problem that many (file) systems have tried to solve with different degrees of success. Hadoop has had to tackle this problem and came up with Hadoop archive. Pomegranate is a new distributed file system that focuses on increasing the performance of storing and accessing small files:
- It handles billions of small files efficiently, even in one directory;
- It provide separate and scalable caching layer, which can be snapshot-able;
- The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
- Build a global namespace for both small files and large files;
- Columnar storage to exploit temporal and spatial locality;
- Distributed extendible hash to index metadata;
- Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
- Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community.
A diagram of the Pomegranate architecture:
Make sure you also read Jeff Darcy’s —who gratefully answered my call for comments — ☞ post on Pomegranate:
- I can see how the Pomegranate scheme efficiently supports looking up a single file among billions, even in one directory (though the actual efficacy of the approach seems unproven). What’s less clear is how well it handles listing all those files, which is kind of a separate problem similar to range queries in a distributed K/V store.
- Another thing I wonder about is the scalability of Pomegranate’s approach to complex operations like rename. There’s some mention of a “reliable multisite update service” but without details it’s hard to reason further. This is a very important issue because this is exactly where several efforts to distribute metadata in other projects – notably Lustre – have foundered. It’s a very very hard problem, so if one’s goal is to create something “worthy for [the] file system community” then this would be a great area to explore further.
Original title and link for this post: Pomegranate: A Solution for Storing Tiny Little Files (published on the NoSQL blog: myNoSQL)