mongodb: All content tagged as mongodb in NoSQL databases and polyglot persistence
I’ve spent most of my career in startups or small companies, that sometimes interacted with large corporation. I’ve also worked a couple of years within a large corporation. But I’ve never been through the transition from startup to corporation.
This is the phase 10gen, the company behind MongoDB, is in right now and they are hiring positions like VP of business development (Ed Albanese, ex-Cloudera), VP of corporate strategy (Matt Asay, ex-Nodeable, Alfresco, Canonical), and VP of services and product management (Ron Avnur, ex-MarkLogic).
In his first post for 10gen, Matt Asay cites 10gen president Max Schireson:
By far our most important competitor is Oracle. After that it’s Oracle, Oracle and Oracle. I see other NoSQL players such as DataStax [distributor of Apache’s Cassandra] and CouchDB as comrades in arms in the battle to persuade people that the answer does not have to be Oracle.
Original title and link: 10gen Transitioning From Startup to Corporation ( ©myNoSQL)
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra ( ©myNoSQL)
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
Aristarkh Zagordnikov wrote me an email describing the reasons that led his company create and open source mod_gridfs.
Some time ago we were looking for a way to serve files to the web right from the GridFS database. We considered different options, including IIS handler (we use .NET on Windows as a backend) that requires a Windows machine to serve files (we planned to use Windows as backend only), nginx-gridfs that was too slow (because it’s synchronous and nginx isn’t, and uses the not-very-much-up-to-date MongoDB C driver that doesn’t do connection pooling, etc.) and does not support slaveOk (horizontal sharding).
At last I decided to roll our own method: a module for Apache 2.2 or higher that uses MongoDB’s own C++ driver. It supports replica sets, slaveOk reads, proper output caching headers (Last-Modified, Etag, Cache-Control, Expires), properly responds to conditional requests (If-Modified-Since/If-None-Match), and uses Apache brigade API to serve large files with less in-memory copying.
While Apache isn’t the most resource-friendly server for a high-load environment (it consumes too much memory per connection and does not yet support production-quality event-based I/O), it really shines as a backend for something like nginx+proxy_cache with optional SSD as proxy_cache storage that does the heavy lifting.
Serving a 4KiB file over a gigabit network on modern hardware, 100 concurrent requests, MongoDB replica set of 3 machines as a backend:
- NGINX + nginx-gridfs: 1.2kr/s
- Apache + mod_gridfs: 6.6kr/s
- Apache + mod_gridfs with slaveOk: 12.1kr/s
I didn’t test with larger files, because this way I’ll be benchmarkng OS I/O performance instead of user-mode code.
The public Mercurial repo is here. It uses Simplified 2-clause BSD license, and contains installation instructions and docs in the README file (building might seem hard, but after building if you have to mass-deploy, you just install dependent libraries like boost and copy the mod_gridfs.so file around).
Original title and link: MongoDB GridFS Over HTTP With Mod_gridfs ( ©myNoSQL)
Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:
- Pig and Hive
- Elastic Search