MongoDB: All content tagged as MongoDB in NoSQL databases and polyglot persistence
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra ( ©myNoSQL)
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
Aristarkh Zagordnikov wrote me an email describing the reasons that led his company create and open source mod_gridfs.
Some time ago we were looking for a way to serve files to the web right from the GridFS database. We considered different options, including IIS handler (we use .NET on Windows as a backend) that requires a Windows machine to serve files (we planned to use Windows as backend only), nginx-gridfs that was too slow (because it’s synchronous and nginx isn’t, and uses the not-very-much-up-to-date MongoDB C driver that doesn’t do connection pooling, etc.) and does not support slaveOk (horizontal sharding).
At last I decided to roll our own method: a module for Apache 2.2 or higher that uses MongoDB’s own C++ driver. It supports replica sets, slaveOk reads, proper output caching headers (Last-Modified, Etag, Cache-Control, Expires), properly responds to conditional requests (If-Modified-Since/If-None-Match), and uses Apache brigade API to serve large files with less in-memory copying.
While Apache isn’t the most resource-friendly server for a high-load environment (it consumes too much memory per connection and does not yet support production-quality event-based I/O), it really shines as a backend for something like nginx+proxy_cache with optional SSD as proxy_cache storage that does the heavy lifting.
Serving a 4KiB file over a gigabit network on modern hardware, 100 concurrent requests, MongoDB replica set of 3 machines as a backend:
- NGINX + nginx-gridfs: 1.2kr/s
- Apache + mod_gridfs: 6.6kr/s
- Apache + mod_gridfs with slaveOk: 12.1kr/s
I didn’t test with larger files, because this way I’ll be benchmarkng OS I/O performance instead of user-mode code.
The public Mercurial repo is here. It uses Simplified 2-clause BSD license, and contains installation instructions and docs in the README file (building might seem hard, but after building if you have to mass-deploy, you just install dependent libraries like boost and copy the mod_gridfs.so file around).
Original title and link: MongoDB GridFS Over HTTP With Mod_gridfs ( ©myNoSQL)
Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:
- Pig and Hive
- Elastic Search
A two-part MongoDB tutorial for PHP developers: first part covering installation and the second giving a brief intro to MongoDB PHP API for queries and indexes. While not the tutorial that will get you up and running, it’s still a good resource for beginners.
Original title and link: MongoDB for PHP Developers ( ©myNoSQL)