NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



GridFS: All content tagged as GridFS in NoSQL databases and polyglot persistence

MongoDB GridFS Over HTTP With mod_gridfs

Aristarkh Zagordnikov wrote me an email describing the reasons that led his company create and open source mod_gridfs.

Some time ago we were looking for a way to serve files to the web right from the GridFS database. We considered different options, including IIS handler (we use .NET on Windows as a backend) that requires a Windows machine to serve files (we planned to use Windows as backend only), nginx-gridfs that was too slow (because it’s synchronous and nginx isn’t, and uses the not-very-much-up-to-date MongoDB C driver that doesn’t do connection pooling, etc.) and does not support slaveOk (horizontal sharding).

At last I decided to roll our own method: a module for Apache 2.2 or higher that uses MongoDB’s own C++ driver. It supports replica sets, slaveOk reads, proper output caching headers (Last-Modified, Etag, Cache-Control, Expires), properly responds to conditional requests (If-Modified-Since/If-None-Match), and uses Apache brigade API to serve large files with less in-memory copying.

While Apache isn’t the most resource-friendly server for a high-load environment (it consumes too much memory per connection and does not yet support production-quality event-based I/O), it really shines as a backend for something like nginx+proxy_cache with optional SSD as proxy_cache storage that does the heavy lifting.

Serving a 4KiB file over a gigabit network on modern hardware, 100 concurrent requests, MongoDB replica set of 3 machines as a backend:

  • NGINX + nginx-gridfs: 1.2kr/s
  • Apache + mod_gridfs: 6.6kr/s
  • Apache + mod_gridfs with slaveOk: 12.1kr/s

I didn’t test with larger files, because this way I’ll be benchmarkng OS I/O performance instead of user-mode code.

The public Mercurial repo is here. It uses Simplified 2-clause BSD license, and contains installation instructions and docs in the README file (building might seem hard, but after building if you have to mass-deploy, you just install dependent libraries like boost and copy the file around).

Original title and link: MongoDB GridFS Over HTTP With Mod_gridfs (NoSQL database©myNoSQL)

MongoDB Replica Sets and Sharding for GridFS as a Distributed File System

Contrary to many MongoDB deployments, we primarily use it for storing files in GridFS. We switched over to MongoDB after searching for a good distributed file system for years. Prior to MongoDB we used a regular NFS share, sitting on top of a HAST-device. That worked great, but it didn’t allow us to scale horizontally the way a distributed file system allows.

No doubt GridFS is a useful feature of MongoDB, but I’m pretty sure the experts in distributed file systems have better solutions for this—I just hope they’ll share it with us.

Update: Jeff Darcy1:

Yes, we do have better solutions for this particular kind of use case.  So do object/blob stores like Swift.  

Honestly, I don’t think the “searching for a good distributed filesystem” part is even credible. How can someone be that bad at finding readily available information?  For example, it’s easier to set up sharding and replication with GlusterFS than with MongoDB and GridFS, plus you’ll get striping and RDMA and generally better performance for this type of workload.  On top of all that, you won’t need to use special libraries to interface with it because it’s a regular POSIX filesystem.  Lastly, it’s not like there hasn’t been a lot of press about it.  Even considering their obvious FreeBSD bias and the fact that FreeBSD is weak in this area, the second i tem for “FreeBSD distributed filesystem” points to GlusterFS.  If they didn’t find it, they just didn’t look very hard before they reached for the New Shiny.  

It’s not just GlusterFS, either.  MogileFS might not be a real filesystem but it’s user space so it would probably run just fine in their environment - as would the aforementioned Swift.  I have more of a problem with the anti-Mongo haters than with Mongo itself, it’s wonderful that these guys found a Mongo-based solution that works for them, but it seems like a bit of an odd choice nonetheless.

  1. Jeff Darcy is a member of the advisory board, and works on GlusterFS full time at Red Hat. He’s also the person I direct all my questions related to distributed file systems (and not only). 

Original title and link: MongoDB Replica Sets and Sharding for GridFS as a Distributed File System (NoSQL database©myNoSQL)


MongoDB GridFS

Did you know that when accessing files from GridFS these are streamed without being loaded entirely in memory?

GridFS splits a file into small chunks storing them in a special chunks collection. Each file has additional metadata: filename, content type, and custom meta stored in a files collection.

GridFS permits range operations, thus one could retrieve only specific ranges of bytes from the file. (nb: I couldn’t find the API for this operation though, so maybe this is not exposed as API in the drivers).

Official GridFS documentation:

Original title and link: MongoDB GridFS (NoSQL databases © myNoSQL)

Harmony Migration to Using GridFS

When we switched to MongoDB over a year ago, we decided it would be all or nothing. Everything in Harmony is stored in Mongo and that includes users, accounts, sites, content, stylesheets, javascripts, and yes, even assets.

Polyglot persistence promises come from using the right tools and not from dropping X for using Y. Unfortunately the post doesn’t provide enough details to say why GridFS.

Original title and link: Harmony Migration to Using GridFS (NoSQL databases © myNoSQL)


3 Reasons to Use MongoDB

Ryan Angilly:

MongoDB is teh awesome because of a simple query syntax, the ability to shard data across machines easily, and the ability to store files in GridFS while taking advantage of replication & sharding.

Indeed, I think the combination of query syntax and GridFS makes MongoDB unique.

Sharding is supported by many other NoSQL databases and for many of these things are even simpler than having mongod, mongos, etc. Between document databases, CouchDB has recently got BigCouch to address the scaling issue[1].

As regards querying, one could say that having MapReduce around would get you similar functionality to MongoDB queries. But starting with users’ familiarity with using queries vs programmatic querying and up to execution behavior MongoDB queries and MapReduce are quite different.

  1. Even before BigCouch, there were different solutions for scaling CouchDB  ()

Original title and link for this post: 3 Reasons to Use MongoDB (published on the NoSQL blog: myNoSQL)


PHP, MongoDB and GridFS

The PHP annotations remember me of doing the same in Java pre-1.5. Still useful to separate metadata from your core code:

You can easily setup a Document that is stored using the MongoDB GridFS by using the @File annotation:

Update: there’s ☞ another post showing how to use PHP annotations for different types of embedded documents in MongoDB


Serving files out of GridFS

Very interesting results testing serving files using Apache, nginx and GridFS.

Solution Requests/sec % Apache FS % nginx FS % nginx GridFS % Apache Ruby
FS via Apache 2625.37 100% 40.03% 242.22% 4,878.96%
FS via nginx 6559.31 249.84% 100% 605.17% 12,189.76%
GridFS via nginx module 1083.88 41.28% 16.52% 100% 2,014.27%
Rails via Passanger 53.81 2.05% 0.82% 4.96% 100%


GridFS: The MongoDB Storage for Large Files

A quick intro from John Nunemaker on GridFS, the MongoDB storage for large files:

The good news is that the API for storing files in GridFS using Ruby is nearly identical to using Ruby’s File class. Unfortunately, that is also the bad news, in my opinion, as I find Ruby’s File open, read and close a bit awkward.

Also make sure you are checking GridFS libraries for more cool projects.