NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



FS: All content tagged as FS in NoSQL databases and polyglot persistence

Comparing Filesystem Performance in Virtual Machines

In the previous post I’ve linked to, How not to benchmark Cassandra, Jonathan Ellis writes on the subject of testing on virtual machines:

This one is actually defensible: if you deploy on VMs, then benchmarking on VMs is the most relevant scenario for you. So as long as you understand the performance hit you’re taking, this can be a reasonable choice. However, care must be taken: noisy neighbors can skew results, especially when using smaller instance sizes. Even with larger instances, it’s much more difficult than you think to get consistent results.

This reminded me of a blog post, I’ve read earlier this year, authored by Mitchell Hashimoto in which he compares the performance of filesystems in VirtualBox and VMware with different settings:

For years, the primary bottleneck for virtual machine based development environments with Vagrant has been filesystem performance. CPU differences are minimal and barely noticeable, and RAM only becomes an issue when many virtual machines are active. I spent the better part of yesterday benchmarking and analyzing common filesystem mechanisms, and now share those results here with you.

Original title and link: Comparing Filesystem Performance in Virtual Machines (NoSQL database©myNoSQL)


Counterpoint: Why Some Hadoop Adapters Make *Perfect* Sense

Jeff Darcy’s reply to Daniel Abadi’s “Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary“:

Going back to Daniel’s argument, RDBMS-to-Hadoop connectors are indeed silly because they incur a migration cost without adding semantic value. Moving from one structured silo to another structured silo really is a waste of time. That is also exactly why filesystem-to-Hadoop connectors do make sense, because they flip that equation on its head — they do add semantic value, and they avoid a migration cost that would otherwise exist when importing data into HDFS. Things like GlusterFS’s UFO or MapR’s “direct access NFS” decrease total time to solution vs. the HDFS baseline.

Original title and link: Counterpoint: Why Some Hadoop Adapters Make *Perfect* Sense (NoSQL database©myNoSQL)


Disks From the Perspective of a File System

Marshall Kirk McKusick prefacing an ACM article about the complexity of dealing with hard disks:

Disks lie. And the controllers that run them are partners in crime.

The article focuses on some of the traps a file system may fall into if it doesn’t know all the details of the underlying hardware: fsync notifications when data is only in the disk controller buffer, ATA disks not implementing correctly Tag Command Queueing, etc.

Original title and link: Disks From the Perspective of a File System (NoSQL database©myNoSQL)


Log-Structured File Systems: There's One in Every SSD

An article from 2009 by Valerie Aurora:

When you say “log-structured file system,” most storage developers will immediately think of Ousterhout and Rosenblum’s classic paper, The Design and Implementation of a Log-structured File System - and the nearly two decades of subsequent work attempting to solve the nasty segment cleaner problem (see below) that came with it. Linux developers might think of JFFS2, NILFS, or LogFS, three of several modern log-structured file systems specialized for use with solid state devices (SSDs). Few people, however, will think of SSD firmware. The flash translation layer in a modern, full-featured SSD resembles a log-structured file system in several important ways. Extrapolating from log-structured file systems research lets us predict how to get the best performance out of an SSD. In particular, full support for the TRIM command, at both the SSD and file system levels, will be key for sustaining long-term peak performance for most SSDs.

Original title and link: Log-Structured File Systems: There’s One in Every SSD (NoSQL database©myNoSQL)


Scaling Filesystems vs. Other Things

Before moving back to NoSQL databases, I wanted to stay in the land of file systems for a conversation between Jeff Darcy and David Strauss about the usage of file systems for large scale and high availability:

As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.

Original title and link: Scaling Filesystems vs. Other Things (NoSQL database©myNoSQL)


ReFS: The Next Generation File System for Windows

Triggered by the last 2 podcasts of John Siracusa1, the last few days I’ve read quite a bit about ZFS and Microsoft’s new file system ReFS. The article I’m linking to contains quite a few interesting bits about ReFS, but the following parts caught my attention:

  1. ReFS is not a log structured file system due to torn writes:

    One of the approaches we considered and rejected was to implement a log structured file system. This approach is unsuitable for the type of general-purpose file system required by Windows. NTFS relies on a journal of transactions to ensure consistency on the disk. That approach updates metadata in-place on the disk and uses a journal on the side to keep track of changes that can be rolled back on errors and during recovery from a power loss. One of the benefits of this approach is that it maintains the metadata layout in place, which can be advantageous for read performance. The main disadvantages of a journaling system are that writes can get randomized and, more importantly, the act of updating the disk can corrupt previously written metadata if power is lost at the time of the write, a problem commonly known as torn write.

  2. ReFS is using B+ trees instead:

    On-disk structures and their manipulation are handled by the on-disk storage engine. This exposes a generic key-value interface, which the layer above leverages to implement files, directories, etc. For its own implementation, the storage engine uses B+ trees exclusively. In fact, we utilize B+ trees as the single common on-disk structure to represent all information on the disk. Trees can be embedded within other trees (a child tree’s root is stored within the row of a parent tree). On the disk, trees can be very large and multi-level or really compact with just a few keys and embedded in another structure. This ensures extreme scalability up and down for all aspects of the file system. Having a single structure significantly simplifies the system and reduces code. The new engine interface includes the notion of “tables” that are enumerable sets of key-value pairs. Most tables have a unique ID (called the object ID) by which they can be referenced. A special object table indexes all such tables in the system.

Even if you are not a file systems expert, this is an interesting read.

  1. Podcasts were a pleasant companion while being sick. 

Original title and link: ReFS: The Next Generation File System for Windows (NoSQL database©myNoSQL)


NetApp Hadoop Shared DAS

In preparation for the EMC Hadoop related announcement:

Shared DAS addresses the inevitable storage capacity growth requirements of Hadoop nodes in a cluster by placing disks in an external shelf shared by multiple directly attached hosts (aka Hadoop compute nodes). The connectivity from host to disk can be SATA, SAS, SCSI or even Ethernet, but always in a direct rather than networked storage configuration.


Therefore the three dimensions of Shared DAS benefit are:

  1. NetApp E-Series Shared DAS solutions can dramatically reduce the amount of background replication tasks by employing highly efficient RAID configurations to offload post-disk failure reconstruction tasks from the Hadoop cluster compute nodes and cluster network,
  2. When compared against single disk I/O configuration of regular Hadoop nodes, NetApp E-Series Shared DAS enables significantly higher disk I/O bandwidth at lower latency due to wide striping within the shelf, and finally,
  3. NetApp E-Series Shared DAS improves storage efficiency by reducing the number of object replicas within a rack using low-overhead high-performance RAID. Fewer replicas mean less disks to buy or more objects stored within the same infrastructure.

But it can also be connected to DataStax Brisk .

Original title and link: NetApp Hadoop Shared DAS (NoSQL databases © myNoSQL)


An Introduction to Distributed Filesystems

Jeff Darcy obliged:

[…] when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.

Original title and link: An Introduction to Distributed Filesystems (NoSQL databases © myNoSQL)


Pomegranate: A Solution for Storing Tiny Little Files

Storing small files is a problem that many (file) systems have tried to solve with different degrees of success. Hadoop has had to tackle this problem and came up with Hadoop archive. Pomegranate is a new distributed file system that focuses on increasing the performance of storing and accessing small files:

  • It handles billions of small files efficiently, even in one directory;
  • It provide separate and scalable caching layer, which can be snapshot-able;
  • The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
  • Build a global namespace for both small files and large files;
  • Columnar storage to exploit temporal and spatial locality;
  • Distributed extendible hash to index metadata;
  • Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
  • Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community.

A diagram of the Pomegranate architecture:


Make sure you also read Jeff Darcy’s —who gratefully answered my call for comments — ☞ post on Pomegranate:

  • I can see how the Pomegranate scheme efficiently supports looking up a single file among billions, even in one directory (though the actual efficacy of the approach seems unproven). What’s less clear is how well it handles listing all those files, which is kind of a separate problem similar to range queries in a distributed K/V store.
  • Another thing I wonder about is the scalability of Pomegranate’s approach to complex operations like rename. There’s some mention of a “reliable multisite update service” but without details it’s hard to reason further. This is a very important issue because this is exactly where several efforts to distribute metadata in other projects – notably Lustre – have foundered. It’s a very very hard problem, so if one’s goal is to create something “worthy for [the] file system community” then this would be a great area to explore further.

Original title and link for this post: Pomegranate: A Solution for Storing Tiny Little Files (published on the NoSQL blog: myNoSQL)


Hadoop: The Problem of Many Small Files

On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:

When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.

Hadoop: The Problem of Many Small Files originally posted on the NoSQL blog: myNoSQL


Redis-based Replication-friendly Filesystem

A new project to the list of the filesystem interfaces on top of NoSQL stores:

So with a little creative thought you end up with a filesystem which is entirely stored in Redis. At this point you’re thinking “Oooh shiny. Fast and shiny”. But then you think “Redis has built in replication support…”

Meanwhile, Redis creator is trying to put a Redis API on top of a filesystem.


BIGDIS: A File System Backed Key-Value Store for Large Values

Salvatore Sanfilippo (@antirez), creator and main developer of Redis, has shared a new project named BIGDIS

Bigdis is a weekend experiment about writing a Redis “clone” implementing a very strict subset of Redis commands (plain key-value basically) and using the file system as back end, that is: every key is represented as a file.

What is the goal of such a monster you may ask? Short answer: storing very large values.

Many kind of DBs are not well suited for storing large “files” as values. I mean things like images, or videos, and so forth. Still in the web-scale era it is very convenient to be able to access this kind of objects in a distributed fashion, with a networking layer, possibly with a protocol that contains already a large number of tested implementations.

While the goal is clearly stated in the above description, I’m not very sure in what scenarios is this new tool considering. For example, what are the advantages of using such a tool instead of say Amazon S3?

Another thing worth pointing is that BIGDIS seems to go the opposite direction of filesystem interfaces to NoSQL databases. BIGDIS proposes a simplified Redis API on top of the FS, while the later aim to provide the FS interface on top of NoSQL solutions.