Apache: All content tagged as Apache in NoSQL databases and polyglot persistence
Wednesday, 27 March 2013
Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop
- Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
- Rudiment ETL that transforms one data format to another data format.
- Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
- Command line interface to allow users to submit SQL queries
- Java API to enable clients to submit SQL queries to Tajo
Just another example of the way of the future.
Original title and link: Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (©myNoSQL)
Monday, 11 March 2013
Open Source “Purity,” Hadoop, and Market Realities
Merv Adrian (Gartner):
The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.
After reading a lot of reactions to EMC’s announcement, the question floating in my head was: how many similar complains have I read about IBM, Amazon, and all other companies that either distribute Hadoop or offer services around it without contributing directly to the Apache Hadoop project? None.
I love open source and I would love if every business using an open source project would find a way to contribute back. But the reality today is different. There are many businesses making use of open source and contributing nothing back. There are also numerous companies making money from open source and contributing back almost nothing. There are very few companies making money directly from their open source projects. And there are very few open source projects that receive any sort of funds to support their communities. Maybe things will change. Or maybe we should take another look at how the open source market works and come up with a different, more sustainable approach.
Original title and link: Open Source “Purity,” Hadoop, and Market Realities (©myNoSQL)
via: http://blogs.gartner.com/merv-adrian/2013/03/09/open-source-purity-hadoop-and-market-realities/
Monday, 30 July 2012
Using Apache to Save Data in Redis
Using a bash script and redis-cli to write Apache stats directly into Redis:
In one of my projects, I was using redis database to collect some statistics and I thought of saving data into it at apache level. This would considerably enhance the speed of saving data as it would not require the interception of grails to save data.
Original title and link: Using Apache to Save Data in Redis (©myNoSQL)
via: http://www.intelligrape.com/blog/2012/07/30/using-apache-to-save-data-in-redis/
Thursday, 26 July 2012
MongoDB GridFS Over HTTP With mod_gridfs
Aristarkh Zagordnikov wrote me an email describing the reasons that led his company create and open source mod_gridfs.
Some time ago we were looking for a way to serve files to the web right from the GridFS database. We considered different options, including IIS handler (we use .NET on Windows as a backend) that requires a Windows machine to serve files (we planned to use Windows as backend only), nginx-gridfs that was too slow (because it’s synchronous and nginx isn’t, and uses the not-very-much-up-to-date MongoDB C driver that doesn’t do connection pooling, etc.) and does not support slaveOk (horizontal sharding).
At last I decided to roll our own method: a module for Apache 2.2 or higher that uses MongoDB’s own C++ driver. It supports replica sets, slaveOk reads, proper output caching headers (Last-Modified, Etag, Cache-Control, Expires), properly responds to conditional requests (If-Modified-Since/If-None-Match), and uses Apache brigade API to serve large files with less in-memory copying.
While Apache isn’t the most resource-friendly server for a high-load environment (it consumes too much memory per connection and does not yet support production-quality event-based I/O), it really shines as a backend for something like nginx+proxy_cache with optional SSD as proxy_cache storage that does the heavy lifting.
Serving a 4KiB file over a gigabit network on modern hardware, 100 concurrent requests, MongoDB replica set of 3 machines as a backend:
- NGINX + nginx-gridfs: 1.2kr/s
- Apache + mod_gridfs: 6.6kr/s
- Apache + mod_gridfs with slaveOk: 12.1kr/s
I didn’t test with larger files, because this way I’ll be benchmarkng OS I/O performance instead of user-mode code.
The public Mercurial repo is here. It uses Simplified 2-clause BSD license, and contains installation instructions and docs in the README file (building might seem hard, but after building if you have to mass-deploy, you just install dependent libraries like boost and copy the mod_gridfs.so file around).
Original title and link: MongoDB GridFS Over HTTP With Mod_gridfs (©myNoSQL)
Thursday, 19 April 2012
Apache Mod_redis
This Apache module uses a rule-based engine (based on regular expression parser) to map URLs to REDIS commands on the fly. It supports an unlimited number of rules and can match on the full URL and the request method (GET, POST, PUT or DELETE) to provide a very flexible option for defining a RESTful interface to REDIS.
Original title and link: Apache Mod_redis (©myNoSQL)
Tuesday, 3 April 2012
The Timeline of the Sqoop Project
A bit of history of yet another BigData-ish/NoSQLish graduating project:

Original title and link: The Timeline of the Sqoop Project (©myNoSQL)
via: https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator
Monday, 5 September 2011
Accumulo: A New BigTable Inspired Distributed Key/Value by NSA
The National Security Agency has submitted to Apache Incubator a proposal to open source Accumulo, a BigTable inspired key-value store that they were building since 2008. The project proposal page provides more details about Accumulo history, building blocks, and how it compares to the other BigTable open source implementation HBase:
-
Access Labels: Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user.
-
Iterators: Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.
-
Flexibility: Accumulo places no restrictions on the column families. Also, each column family in HBase is stored separately on disk. Accumulo allows column families to be grouped together on disk, as does BigTable.
-
Logging: HBase uses a write-ahead log on the Hadoop Distributed File System. Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.
-
Storage: Accumulo has a relative key file format that improves compression.
You can read more about Accumulo here and check the Hacker News and Reddit discussions.
Michael Stack has commented on the HBase mailing list:
The cell based ‘access labels’ seem like a matter of adding an extra field to KV and their Iterators seem like a specialization on Coprocessors. The ability to add column families on the fly seems too minor a difference to call out especially if online schema edits are now (soon) supported. They talk of locality group like functionality too — that could be a significant difference. We would have to see the code but at first blush, differences look small.
Original title and link: Accumulo: A New BigTable Inspired Distributed Key/Value by NSA (©myNoSQL)
Monday, 27 June 2011
Building an Ad Network Ready for Failure
The architecture of a fault-tolerant ad network built on top of HAProxy, Apache with mod_wsgi and Python, Redis, a bit of PostgreSQL and ActiveMQ deployed on AWS:
The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple “slaveof” calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.
Original title and link: Building an Ad Network Ready for Failure (©myNoSQL)
via: http://dr-josiah.blogspot.com/2011/06/building-ad-network-ready-for-failure.html
Monday, 6 June 2011
Apache CouchDB 1.1.0 Released: Native SSL, HTTP Range Requests
Robert Newson just announced a new version of Apache CouchDB, 1.1.0, featuring native SSL, HTTP range requests, and a other features and improvements listed below:
- Native SSL support.
- Added support for HTTP range requests for attachments.
- Added built-in filters for
_changes:_doc_idsand_design. - Added configuration option for TCP_NODELAY aka “Nagle”.
- Allow wildcards in vhosts definitions.
- More granular ETag support for views.
- More flexible URL rewriter.
- Added OS Process module to manage daemons outside of CouchDB.
- Added HTTP Proxy handler for more scalable externals.
- Added
_replicatordatabase to manage replications. - Multiple micro-optimizations when reading data.
- Added CommonJS support to map functions.
- Added
stale=update_afterquery option that triggers a view update after returning astale=okresponse. - More explicit error messages when it’s not possible to access a file due to lack of permissions.
- Added a “change password”-feature to Futon.
While all these sound interesting, many of the items listed in this user suggested post 1.0 CouchDB roadmap didn’t make it in yet.
Original title and link: Apache CouchDB 1.1.0 Released: Native SSL, HTTP Range Requests (NoSQL databases © myNoSQL)