ZooKeeper: All content tagged as ZooKeeper in NoSQL databases and polyglot persistence
Apache ZooKeeper, the high-performance coordination service exposing services like naming, configuration management, synchronization, etc. for distributed applications, has reached version 3.4.0.
The most important ones are summarized by Patrick Hunt in this Cloudera blog post:
- ZooKeeper 3.3.3 clients are compatible with 3.4.0 servers
- Native Windows version of C client
- Support Kerberos authentication of clients
- Support Kerberos authentication of clients
- Improved REST Interface
- Existing monitoring support has been extended through the introduction of a new ‘mntr’ 4 letter word
- Add tools and recipes for monitoring as a contrib
- Web-based Administrative Interface
- Automating log and snapshot cleaning
- Add logging/stats to identify production deployment issues
- Support for building RPM and DEB packages
Something to keep in mind though: ZooKeeper 3.4.0 is not production ready yet. After extensive testing, it will be followed soon by a minor release that will be production-ready.
Original title and link: Apache ZooKeeper 3.4.0 Released to Be Followed Soon by Production-Ready Version ( ©myNoSQL)
Last night, Mark Phillips sent me the following message:
Yesterday a Riak community member, Joseph Blumstedt released a pair of repos: riak_zab and riak_zab_example. In short, riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities.
If your first reaction was something along the line “so what?”, then you are not alone. Anyway I hope I’ve done my homework and now I understand why this is both interesting and important.
ZooKeeper is a coordination service created by Yahoo! to allow large scale applications to perform coordination tasks such as leader election, status propagation, and rendezvous. Embedded into ZooKeeper is a totally ordered broadcast protocol: Zab.
ZooKeeper makes the following requirements on the Zab broadcast protocol:
- Reliable delivery
- If a message, m, is delivered by one server, then it will be eventually delivered by all correct servers.
- Total order
- If message a is delivered before message b by one server, then every server that delivers a and b delivers a before b.
- Causal order
- If message a causally precedes message b and both messages are delivered, then a must be ordered before b.
- Prefix property:
- If m is the last message delivered for a leader L, any message proposed before m by L must also be delivered
Putting it together:
riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities. This is accomplished through a pure Erlang implementation of Zab, the Zookeeper Atomic Broadcast protocol invented by Yahoo! Research.
Basically this means that many operations that were typically hard to do well using the Dynamo eventual consistency model which Riak implements with riak_core are now theoretically possible. In case you’d like to read more about riak_core I’d suggest these posts Building blocks of Dynamo-like distributed systems, riak_core: building distributed applications without shared state, and Where to start with riak_core .
Indeed there are other Dynamo aspects like synchronization, scaling, ring rebalancing, handoff that riak_zab had to address and its approach is documented on the project page
Technically, riak_zab isn’t focused on message delivery. Rather, riak_zab provides consistent replicated state.
So why is this important? While parts of an application can deal with eventual consistency, there are typically a few aspects of an app where stronger consistency is needed for various reasons. A few such examples:
- Distributed Counters — just ask Digg people how they implemented distributed counters in Cassandra or check how Twitter built a ZooKeeper and Cassandra based solution for realtime analytics .
- Ordered Messaging
- Compare-and-Swap Operations
- Set Operations
- Sub-document updates
- Multi-key transactions
riak_zab will theoretically allow users to reap the availability of the standard eventually consistent model Riak provides and opt in to stronger consistency for the parts of the app that require it. In other words, it expands the number of CAP trade offs that a developer can make beyond the R, W and N model and makes possible functionality that relies on consistency guarantees.
After connecting all these dots, I’ve finally got it (nb: thanks Mark for sending this out!). And it definitely sounds like an exciting addition to riak_core. The caveat here is that even if this code has been tested in an academic environment the project is still an alpha version. Hopefully other members of the Riak community will pick it up and together with Basho guys will make it production ready.
Kevin Weil presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:
Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.
Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)
- A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
- A common service API. The details of provisioning are particular to the service.
- Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
For now, Whirr supports only Hadoop, Cassandra, and Zookeeper on Amazon EC2, but roadmap specify support for both more products and cloud solutions.
While understanding the advantages of abstracting away each cloud custom API, I’m wondering why this project have been started from scratch and not built on top of solution like Chef or Puppet1.
Could anyone please remind me what similar projects are available in Python? ↩
Just a quick reference of the continuously growing Hadoop tools ecosystem.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
A distributed file system that provides high throughput access to application data.
A software framework for distributed processing of large data sets on compute clusters.
Amazon Elastic MapReduce
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Cloudera Distribution for Hadoop (CDH)
Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
A scalable, distributed database that supports structured data storage for large tables.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).
Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.
You can read more about HUE on ☞ Cloudera blog.
Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
A Scalable machine learning and data mining library.
Integration with Relational databases
Integration with Data Warehouses
The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.
Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference
I didn’t realize Hadoop has been so long on the market: 5 years. In just a couple of hours, the celebration will start at ☞ Hadoop Summit in Santa Clara.
Yahoo!, the most active contributor to Hadoop, will ☞ open source today two new tools: Hadoop with Security and Oozie, a workflow engine.
Hadoop Security integrates Hadoop with Kerberos, providing secure access and processing of business-sensitive data.This enables organizations to leverage and extract value from their data and hardware investment in Hadoop across the enterprise while maintaining data security, allowing new collaborations and applications with business-critical data.
Oozie is an open-source workflow solution to manage jobs running on Hadoop, including HDFS, Pig, and MapReduce. Oozie — a name for an elephant tamer — was designed for Yahoo!’s rigorous use case of managing complex workflows and data pipelines at global scale. It is integrated with Hadoop Security and is quickly becoming the de-facto standard for ETL (extraction, transformation, loading) processing at Yahoo!.
Update: It looks like the news are not stopping here, Cloudera making ☞ a big announcement accompanying the new release of Cloudera’s Distribution for Hadoop CDHv3 Beta2:
The additional packages include HBase, the popular distributed columnar storage system with fast read-write access to data managed by HDFS, Hive and Pig for query access to data stored in a Hadoop cluster, Apache Zookeeper for distributed process coordination and Sqoop for moving data between Hadoop and relational database systems. We’ve adopted the outstanding workflow engine out of Yahoo!, Oozie, and have made contributions of our own to adapt it for widespread use by general enterprise customers. We’ve also released – this is a big deal, and I’m really pleased to announce it – our continuous data loading system, Flume, and our Hadoop User Environment software (formerly Cloudera Desktop, and henceforth “Hue”) under the Apache Software License, version 2.
Also worth mentioning, going forward Cloudera will also have a commercial offering: ☞ Cloudera Enterprise:
Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools that our enterprise customers have told us they need to put Hadoop into production. We’ve added dashboards for critical IT tasks, including monitoring cluster status and activity, keeping track of data flows into Hadoop in real time based on the services that Flume provides, and controlling access to data and resources by users and groups. We’ve integrated access controls with Active Directory and other LDAP implementations so that IT staff can control rights and identities in the same way as they do for other business platforms they use. Cloudera Enterprise is available by annual subscription and includes maintenance, updates and support.