NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Apache: All content tagged as Apache in NoSQL databases and polyglot persistence

‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop


  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

Just another example of the way of the future.

Original title and link: ‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (NoSQL database©myNoSQL)


Open Source “Purity,” Hadoop, and Market Realities

Merv Adrian (Gartner):

The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.

After reading a lot of reactions to EMC’s announcement, the question floating in my head was: how many similar complains have I read about IBM, Amazon, and all other companies that either distribute Hadoop or offer services around it without contributing directly to the Apache Hadoop project? None.

I love open source and I would love if every business using an open source project would find a way to contribute back. But the reality today is different. There are many businesses making use of open source and contributing nothing back. There are also numerous companies making money from open source and contributing back almost nothing. There are very few companies making money directly from their open source projects. And there are very few open source projects that receive any sort of funds to support their communities. Maybe things will change. Or maybe we should take another look at how the open source market works and come up with a different, more sustainable approach.

Original title and link: Open Source “Purity,” Hadoop, and Market Realities (NoSQL database©myNoSQL)


Using Apache to Save Data in Redis

Using a bash script and redis-cli to write Apache stats directly into Redis:

In one of my projects, I was using redis database to collect some statistics and I thought of saving data into it at apache level. This would considerably enhance the speed of saving data as it would not require the interception of grails to save data.

Original title and link: Using Apache to Save Data in Redis (NoSQL database©myNoSQL)


MongoDB GridFS Over HTTP With mod_gridfs

Aristarkh Zagordnikov wrote me an email describing the reasons that led his company create and open source mod_gridfs.

Some time ago we were looking for a way to serve files to the web right from the GridFS database. We considered different options, including IIS handler (we use .NET on Windows as a backend) that requires a Windows machine to serve files (we planned to use Windows as backend only), nginx-gridfs that was too slow (because it’s synchronous and nginx isn’t, and uses the not-very-much-up-to-date MongoDB C driver that doesn’t do connection pooling, etc.) and does not support slaveOk (horizontal sharding).

At last I decided to roll our own method: a module for Apache 2.2 or higher that uses MongoDB’s own C++ driver. It supports replica sets, slaveOk reads, proper output caching headers (Last-Modified, Etag, Cache-Control, Expires), properly responds to conditional requests (If-Modified-Since/If-None-Match), and uses Apache brigade API to serve large files with less in-memory copying.

While Apache isn’t the most resource-friendly server for a high-load environment (it consumes too much memory per connection and does not yet support production-quality event-based I/O), it really shines as a backend for something like nginx+proxy_cache with optional SSD as proxy_cache storage that does the heavy lifting.

Serving a 4KiB file over a gigabit network on modern hardware, 100 concurrent requests, MongoDB replica set of 3 machines as a backend:

  • NGINX + nginx-gridfs: 1.2kr/s
  • Apache + mod_gridfs: 6.6kr/s
  • Apache + mod_gridfs with slaveOk: 12.1kr/s

I didn’t test with larger files, because this way I’ll be benchmarkng OS I/O performance instead of user-mode code.

The public Mercurial repo is here. It uses Simplified 2-clause BSD license, and contains installation instructions and docs in the README file (building might seem hard, but after building if you have to mass-deploy, you just install dependent libraries like boost and copy the file around).

Original title and link: MongoDB GridFS Over HTTP With Mod_gridfs (NoSQL database©myNoSQL)

Apache Mod_redis


This Apache module uses a rule-based engine (based on regular expression parser) to map URLs to REDIS commands on the fly. It supports an unlimited number of rules and can match on the full URL and the request method (GET, POST, PUT or DELETE) to provide a very flexible option for defining a RESTful interface to REDIS.

Original title and link: Apache Mod_redis (NoSQL database©myNoSQL)

The Timeline of the Sqoop Project

A bit of history of yet another BigData-ish/NoSQLish graduating project:

A timeline of Sqoop Project

Original title and link: The Timeline of the Sqoop Project (NoSQL database©myNoSQL)


Accumulo: A New BigTable Inspired Distributed Key/Value by NSA

The National Security Agency has submitted to Apache Incubator a proposal to open source Accumulo, a BigTable inspired key-value store that they were building since 2008. The project proposal page provides more details about Accumulo history, building blocks, and how it compares to the other BigTable open source implementation HBase:

  • Access Labels: Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user.

  • Iterators: Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

  • Flexibility: Accumulo places no restrictions on the column families. Also, each column family in HBase is stored separately on disk. Accumulo allows column families to be grouped together on disk, as does BigTable.

  • Logging: HBase uses a write-ahead log on the Hadoop Distributed File System. Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.

  • Storage: Accumulo has a relative key file format that improves compression.

You can read more about Accumulo here and check the Hacker News and Reddit discussions.

Michael Stack has commented on the HBase mailing list:

The cell based ‘access labels’ seem like a matter of adding an extra field to KV and their Iterators seem like a specialization on Coprocessors. The ability to add column families on the fly seems too minor a difference to call out especially if online schema edits are now (soon) supported. They talk of locality group like functionality too — that could be a significant difference. We would have to see the code but at first blush, differences look small.

Original title and link: Accumulo: A New BigTable Inspired Distributed Key/Value by NSA (NoSQL database©myNoSQL)

Building an Ad Network Ready for Failure

The architecture of a fault-tolerant ad network built on top of HAProxy, Apache with mod_wsgi and Python, Redis, a bit of PostgreSQL and ActiveMQ deployed on AWS:

The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple “slaveof” calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.

Original title and link: Building an Ad Network Ready for Failure (NoSQL database©myNoSQL)


Apache CouchDB 1.1.0 Released: Native SSL, HTTP Range Requests

Robert Newson just announced a new version of Apache CouchDB, 1.1.0, featuring native SSL, HTTP range requests, and a other features and improvements listed below:

  • Native SSL support.
  • Added support for HTTP range requests for attachments.
  • Added built-in filters for _changes: _doc_ids and _design.
  • Added configuration option for TCP_NODELAY aka “Nagle”.
  • Allow wildcards in vhosts definitions.
  • More granular ETag support for views.
  • More flexible URL rewriter.
  • Added OS Process module to manage daemons outside of CouchDB.
  • Added HTTP Proxy handler for more scalable externals.
  • Added _replicator database to manage replications.
  • Multiple micro-optimizations when reading data.
  • Added CommonJS support to map functions.
  • Added stale=update_after query option that triggers a view update after returning a stale=ok response.
  • More explicit error messages when it’s not possible to access a file due to lack of permissions.
  • Added a “change password”-feature to Futon.

While all these sound interesting, many of the items listed in this user suggested post 1.0 CouchDB roadmap didn’t make it in yet.

Original title and link: Apache CouchDB 1.1.0 Released: Native SSL, HTTP Range Requests (NoSQL databases © myNoSQL)