ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

ZooKeeper: All content tagged as ZooKeeper in NoSQL databases and polyglot persistence

ZooKeeper - Distributed Process Coordination

Go get the book now1:

cat

From an interview with the authors, Flavio Junquiera22:

Although ZooKeeper has been around for a while, there was no good starting point for beginners or even a comprehensive reference for advanced developers. The online documentation is ok, but it lacks depth and is a bit scattered. Also, in some sense, ZooKeeper is a little bit too easy to jump in and use. We found that people were jumping in without fully understanding what they needed to watch out for, ZooKeeper’s limitations, and how to deal with some faulty scenarios.


  1. No commission, no affiliate links. Just a book you need. 

  2. Flavio Junqueira is working at Microsoft Research and is the PMC Chair of Apache ZooKeeper 

Original title and link: ZooKeeper - Distributed Process Coordination (NoSQL database©myNoSQL)


Using Apache ZooKeeper to Build Distributed Apps (And Why)

Great intro to ZooKeeper and the problems it can help solve by Sean Mackrory:

Done wrong, distributed software can be the most difficult to debug. Done right, however, it can allow you to process more data, more reliably and in less time. Using Apache ZooKeeper allows you to confidently reason about the state of your data, and coordinate your cluster the right way. You’ve seen how easy it is to get a ZooKeeper server up and running. (In fact, if you’re a CDH user, you may already have an ensemble running!) Think about how ZooKeeper could help you build more robust systems

Leaving aside for a second the main topic of the post, another important lesson here is that the NIH syndrom in distributed systems is very expensive.

Original title and link: Using Apache ZooKeeper to Build Distributed Apps (And Why) (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/02/how-to-use-apache-zookeeper-to-build-distributed-apps-and-why/


Next Neo4j Version Implementing HA Without ZooKeeper

The next version of Neo4j will remove the dependency on ZooKeeper for high availability setups. In a post on Neo4j blog, the team has announced the availability of the 1st milestone of Neo4j 1.9 which already contains the new implementation of Neo4j High Availability Cluster:

With Neo4j 1.9 M01, cluster members communicate directly with each other, based on an implementation of the Paxos consensus protocol for master election.

According to the updated documentation annotated with my own comments:

  • Write transactions can be performed on any database instance in a cluster. (nb: writes are performed on the master first, but the cluster does the routing automatically)
  • If the master fails a new master will be elected automatically. A new master is elected and started within just a few seconds and during this time no writes can take place (the writes will block or in rare cases throw an exception)
  • If the master goes down any running write transaction will be rolled back and new transactions will block or fail until a new master has become available.
  • The cluster automatically handles instances becoming unavailable (for example due to network issues), and also makes sure to accept them as members in the cluster when they are available again.
  • Transactions are atomic, consistent and durable but eventually propagated out to other slaves. (nb: a transaction includes only the write to the master)
  • Updates to slaves are eventual consistent by nature but can be configured to be pushed optimistically from master during commit. (nb: writes to slave will still not be part of the transaction)
  • In case there were changes on the master that didn’t get replicated before it failed, there are chances to reach a situation where two different versions exists—if the failed master recovers. This situation is resolved by having the old master dismiss its copy of the data (nb the documentation says move away)
  • Reads are highly available and the ability to handle read load scales with more database instances in the cluster.

Original title and link: Next Neo4j Version Implementing HA Without ZooKeeper (NoSQL database©myNoSQL)


Dynamic Reconfiguration of Apache ZooKeeper

A slide deck by Alexander Shraer providing a shorter version of the Dynamic Reconfiguration of Primary/Backup Clusters in Apache ZooKeeper paper.

There are still some scenarios where the proposal algorithm will not work, but I cannot tell how often these will occur:

  1. Quorum of new ensemble must be in sync
  2. Another reconfig in progress
  3. Version condition check fails

One of the most interesting slides in the deck is the one explaining the failure free flow:

ZooKeeper Failure Free Flow


Dynamic Reconfiguration of Primary/Backup Clusters in Apache ZooKeeper

A paper by Alexander Shraer, Benjamin Reed, Flavio Junqueira (Yahoo) and Dahlia Malkhi (Microsoft):

Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art.

The corresponding ZooKeeper issue has been created in 2008 and the new protocol should be part of ZooKeeper 3.5.0


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)


ZkFarmer: Tools for Managing Distributed Server Farms Using Apache ZooKeeper

ZkFarmer:

With ZkFarmer, each server registers itself in one or several farms. Thru this registration, hosts can expose arbitrary information about their status.

On the other end, ZkFarmer helps consumers of the farm to maintain a configuration file in sync with the list of hosts registered in the farm with their respective configuration.

In the middle, ZkFarmer helps monitoring and administrative services to easily read and change the configuration of each host.

Currently ZkFarmer provides the following functionality:

  1. Registering ZooKeeper services to one or several farms (joining a farm)
  2. Listing farms and hosts
  3. Read and Write farms content
  4. Farm monitoring
  5. Syncing farm configuration

Original title and link: ZkFarmer: Tools for Managing Distributed Server Farms Using Apache ZooKeeper (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)


ZooKeeper 3.4.3 First Beta Quality in the 3.4 Series

After 3 alpha versions in the 3.4 series, ZooKeeper 3.4.3, announced the other day, is the first to be considered a beta release—it’s not production ready, but more serious bugs have been fixed. Being such a critical component of the Hadoop ecosystem it is essential for ZooKeeper to go through extensive testing before being declared production ready.

Besides the release notes, both Hortworks and Cloudera have a short summary about ZooKeeper 3.4.

Original title and link: ZooKeeper 3.4.3 First Beta Quality in the 3.4 Series (NoSQL database©myNoSQL)


The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)


Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0

Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.

MongoDB 2.0.2

Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:

  • Hit config server only once per mongos on meta data change to not overwhelm
  • Removed unnecessary connection close and open between mongos and mongod after getLastError
  • Replica set primaries close all sockets on stepDown()
  • Do not require authentication for the buildInfo command
  • scons option for using system libraries

Apache Hive 0.8.0

Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.

Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?

Apache ZooKeeper 3.4.2

ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.

As with ZooKeeper 3.4.0, these versions are not yet production ready.

Apache Whirr 0.7.0

Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.

Some more details about Whirr 0.7.0 can be found here.

Apache HBase 0.90.5

Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.

Redis 2.4.5

Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:

  • [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has +inf/-inf score and the weight used is 0.
  • [BUGFIX] Fixed memory leak in CLIENT INFO.
  • [BUGFIX] Fixed a non critical SORT bug (Issue 224).
  • [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
  • --quiet option implemented in the Redis test.

Last but definitely one of the most important announcements that came in December:

Hadoop 1.0.0

Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:

  • HBase (append/hsynch/hflush) and Security
  • Webhdfs (with full support for security)
  • Performance enhanced access to local files for HBase
  • Other performance enhancements, bug fixes, and features
  • All version 0.20.205 and prior 0.20.2xx features

Complete release notes are available here.

Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy

And with this we are ready for 2012.

Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (NoSQL database©myNoSQL)


Cassandra, Zookeeper, Scribe, and Node.js Powering Rackspace Cloud Monitoring

Paul Querna describes the original architecture of Cloudkick and the one that powers the recently announced Rackspace Cloud Monitoring service:

Development framework: from Twisted Python and Django to Node.js

Cloudkick was primarily written in Python. Most backend services were written in Twisted Python. The API endpoints and web server were written in Django, and used mod_wsgi. […] Cloud Monitoring is primarily written in Node.js.

Storage: from master-slave MySQL to Cassandra

Cloudkick was reliant upon a MySQL master and slaves for most of its configuration storage. This severely limited both scalability, performance and multi-region durability. These issues aren’t necessarily a property of MySQL, but Cloudkick’s use of the Django ORM made it very difficult to use MySQL radically differently. The use of MySQL was not continued in Cloud Monitoring, where metadata is stored in Apache Cassandra.

Even more Cassandra:

Cloudkick used Apache Cassandra primarily for metrics storage. This was a key element in keeping up with metrics processing, and providing a high quality user experience, with fast loading graphs. Cassandra’s role was expanded in Cloud Monitoring to include both configuration data and metrics storage.

Event processing: from RabbitMQ to Zookeeper and a bit more Cassandra

RabbitMQ is not used by Cloud Monitoring. Its use cases are being filled by a combination of Apache Zookeeper, point to point REST or Thrift APIs, state storage in Cassandra and changes in architecture.

And finally Scribe:

Cloudkick used an internal fork of Facebook’s Scribe for transporting certain types of high volume messages and data. Scribe’s simple configuration model and API made it easy to extend for our bulk messaging needs. Cloudkick extended Scribe to include a write ahead journal and other features to improve durability. Cloud Monitoring continues to use Scribe for some of our event processing flows.

Original title and link: Cassandra, Zookeeper, Scribe, and Node.js Powering Rackspace Cloud Monitoring (NoSQL database©myNoSQL)

via: http://journal.paul.querna.org/articles/2011/12/17/technology-cloud-monitoring/