ZooKeeper: All content tagged as ZooKeeper in NoSQL databases and polyglot persistence
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
-infscore and the weight used is 0.
- [BUGFIX] Fixed memory leak in
- [BUGFIX] Fixed a non critical
SORTbug (Issue 224).
- [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 ( ©myNoSQL)
Apache ZooKeeper, the high-performance coordination service exposing services like naming, configuration management, synchronization, etc. for distributed applications, has reached version 3.4.0.
The most important ones are summarized by Patrick Hunt in this Cloudera blog post:
- ZooKeeper 3.3.3 clients are compatible with 3.4.0 servers
- Native Windows version of C client
- Support Kerberos authentication of clients
- Support Kerberos authentication of clients
- Improved REST Interface
- Existing monitoring support has been extended through the introduction of a new ‘mntr’ 4 letter word
- Add tools and recipes for monitoring as a contrib
- Web-based Administrative Interface
- Automating log and snapshot cleaning
- Add logging/stats to identify production deployment issues
- Support for building RPM and DEB packages
Something to keep in mind though: ZooKeeper 3.4.0 is not production ready yet. After extensive testing, it will be followed soon by a minor release that will be production-ready.
Original title and link: Apache ZooKeeper 3.4.0 Released to Be Followed Soon by Production-Ready Version ( ©myNoSQL)
Last night, Mark Phillips sent me the following message:
Yesterday a Riak community member, Joseph Blumstedt released a pair of repos: riak_zab and riak_zab_example. In short, riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities.
If your first reaction was something along the line “so what?”, then you are not alone. Anyway I hope I’ve done my homework and now I understand why this is both interesting and important.
ZooKeeper is a coordination service created by Yahoo! to allow large scale applications to perform coordination tasks such as leader election, status propagation, and rendezvous. Embedded into ZooKeeper is a totally ordered broadcast protocol: Zab.
ZooKeeper makes the following requirements on the Zab broadcast protocol:
- Reliable delivery
- If a message, m, is delivered by one server, then it will be eventually delivered by all correct servers.
- Total order
- If message a is delivered before message b by one server, then every server that delivers a and b delivers a before b.
- Causal order
- If message a causally precedes message b and both messages are delivered, then a must be ordered before b.
- Prefix property:
- If m is the last message delivered for a leader L, any message proposed before m by L must also be delivered
Putting it together:
riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities. This is accomplished through a pure Erlang implementation of Zab, the Zookeeper Atomic Broadcast protocol invented by Yahoo! Research.
Basically this means that many operations that were typically hard to do well using the Dynamo eventual consistency model which Riak implements with riak_core are now theoretically possible. In case you’d like to read more about riak_core I’d suggest these posts Building blocks of Dynamo-like distributed systems, riak_core: building distributed applications without shared state, and Where to start with riak_core .
Indeed there are other Dynamo aspects like synchronization, scaling, ring rebalancing, handoff that riak_zab had to address and its approach is documented on the project page
Technically, riak_zab isn’t focused on message delivery. Rather, riak_zab provides consistent replicated state.
So why is this important? While parts of an application can deal with eventual consistency, there are typically a few aspects of an app where stronger consistency is needed for various reasons. A few such examples:
- Distributed Counters — just ask Digg people how they implemented distributed counters in Cassandra or check how Twitter built a ZooKeeper and Cassandra based solution for realtime analytics .
- Ordered Messaging
- Compare-and-Swap Operations
- Set Operations
- Sub-document updates
- Multi-key transactions
riak_zab will theoretically allow users to reap the availability of the standard eventually consistent model Riak provides and opt in to stronger consistency for the parts of the app that require it. In other words, it expands the number of CAP trade offs that a developer can make beyond the R, W and N model and makes possible functionality that relies on consistency guarantees.
After connecting all these dots, I’ve finally got it (nb: thanks Mark for sending this out!). And it definitely sounds like an exciting addition to riak_core. The caveat here is that even if this code has been tested in an academic environment the project is still an alpha version. Hopefully other members of the Riak community will pick it up and together with Basho guys will make it production ready.
Kevin Weil presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:
Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.
Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)
- A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
- A common service API. The details of provisioning are particular to the service.
- Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
For now, Whirr supports only Hadoop, Cassandra, and Zookeeper on Amazon EC2, but roadmap specify support for both more products and cloud solutions.
While understanding the advantages of abstracting away each cloud custom API, I’m wondering why this project have been started from scratch and not built on top of solution like Chef or Puppet1.
Could anyone please remind me what similar projects are available in Python? ↩
Just a quick reference of the continuously growing Hadoop tools ecosystem.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
A distributed file system that provides high throughput access to application data.
A software framework for distributed processing of large data sets on compute clusters.
Amazon Elastic MapReduce
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Cloudera Distribution for Hadoop (CDH)
Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
A scalable, distributed database that supports structured data storage for large tables.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).
Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.
You can read more about HUE on ☞ Cloudera blog.
Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
A Scalable machine learning and data mining library.
Integration with Relational databases
Integration with Data Warehouses
The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.
Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference