NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



YARN: All content tagged as YARN in NoSQL databases and polyglot persistence

Docker, Hadoop and YARN

Jack Clark (The Register) covers the work done to integrate Docker with Hadoop:

“Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the entire unix filesystem image for any YARN container,” explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

Original title and link: Docker, Hadoop and YARN (NoSQL database©myNoSQL)


Apache Hadoop 2 - YARN is GA

Even if there’s been almost 3 weeks since the announcement, Apache Hadoop 2 is too big of a news not to mention it here. If you want to read something about it, here are a couple of links:

  • The Apache Software Foundation Announces Apache™ Hadoop™ 2 (a bit PRish)

    Doug Cutting:

    What started out a few years ago as a scalable batch processing system for Java programmers has now emerged as the kernel of the operating system for big data.

  • A short interview with Rohit Bakhshi (product manager at Hortonwork) YARN Brings New Capabilities To Hadoop:

    By turning Apache Hadoop 2.0 into a multi- application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.

  • Mike Miller’s post on GigaOm: Why the world should care about Hadoop 2:

    This might be surprising, because Hadoop 2 is not a blow-your-socks-off release. It is not packed with revolutionary new features from a user perspective. Instead, its greatest innovation is a glorious refactoring of some internal plumbing. But that plumbing grants the community of Hadoop developers the pathways they need to address some of Hadoops greatest shortcomings in comparison to both the commercial and the internal Google tools that Hadoop was derived from.

  • Last but not least, any article you can find about YARN and signed Aarun C. Murthy will be well worth reading (e.g. Apache Hadoop YARN – Background and an Overview, old but very very details series about YARN’s objectives, or Moving Hadoop Beyond Batch with Apache YARN

Original title and link: Apache Hadoop 2 - YARN is GA (NoSQL database©myNoSQL)

Hoya, HBase on YARN, Architecture

The architecture of HBase on top of YARN, a project named Hoya:


The main question I had about what YARN would bring to HBase is answered in the post. But I’m still not sure I get the whole picture of how YARN improves HBase’s availability (if it does it):

YARN keeps an eye on the health of the containers, telling the AM when there is a problem. It also monitors the Hoya AM itself. When the AM fails, YARN allocates a new container for it, and restarts it. This provides an availability solution to Hoya without it having to code it in itself.

Original title and link: Hoya, HBase on YARN, Architecture (NoSQL database©myNoSQL)


Improvements in the Hadoop YARN Fair Scheduler

Sandy Ryza goes through the changes in the YARN Fair Scheduler:

A big change in the YARN Fair Scheduler is how it defines a “resource”. In MR1, the basic unit of scheduling was the “slot”, an abstraction of a space for a task on a machine in the cluster. Because YARN expects to schedule jobs with heterogeneous task resource requests, it instead allows containers to request variable amounts of memory and schedules based on those. Cluster resources no longer need to be partitioned into map and reduce slots, meaning that a large job can use all the resources in the cluster in its map phase and then do so again in its reduce phase. This allows for better utilization of the cluster, better treatment of tasks with high resource requests, and more portability of jobs between clusters — a developer no longer needs to worry about a slot meaning different things on different clusters; rather, they can request concrete resources to satisfy their jobs’ needs. Additionally, work is being done (YARN-326) that will allow the Fair Scheduler to schedule based on CPU requirements and availability as well.

Basically the scheduler in Hadoop goes from a minimum viable product to a resource aware scheduler. But as far as I know, schedulers in commercial MPP systems are even smarter and more configurable, so there’s still room for improvements.

Original title and link: Improvements in the Hadoop YARN Fair Scheduler (NoSQL database©myNoSQL)


The origin of YARN

Klint Finley tracking the origin of YARN:

Arun C. Murthy awoke to a phone call. It was 3 a.m., and an ad-targeting application at Yahoo, where he was an engineer, was running at painfully slow speeds. The culprit: a piece of software code that tapped into the open source number-crunching platform Hadoop. Someone else had written the code, but it was Murthy’s job to fix it.

Like many other brilliant things, YARN has been born or at least inspired by the hate of the status quo.

Original title and link: The origin of YARN (NoSQL database©myNoSQL)


Facebook Corona: A Different Approach to Job Scheduling and Resource Management

Facebook engineering: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona:

It was pretty clear that we would ultimately need a better scheduling framework that would improve this situation in the following ways:

  • Better scalability and cluster utilization
  • Lower latency for small jobs
  • Ability to upgrade without disruption
  • Scheduling based on actual task resource requirements rather than a count of map and reduce tasks
  1. Hadoop deployment at Facebook:

    • 100PB

    • 60000 Hive queries/day

    • used by > 1000 people

    Is Hive the preferred way Hadoop is used at Facebook?

  2. Facebook is running it’s own version of HDFS. Once you fork, integrating upstream changes becomes a nightmare.

  3. How to deploy and test new features at scale: rank types of users and roll out the new feature starting with the less critical scenarios. You must be able to correctly route traffic or users.
  4. At scale, cluster utilization is a critical metric. All the improvements in Corona are derived from this.
  5. Traditional analytic databases have advanced resource-based scheduling for a long time. Hadoop needs this.
  6. Open source at Facebook:
    1. create a tool that addresses an internal problem
    2. open source it throw it out in the wild (nb: is there any Facebook open source project they continued to maintain?)
    3. Option 1: continue to develop it internally. Option 2: drop it
    4. if by any chance the open source project survives and becomes a standalone project, catch up from time to time
    5. re-fork it
  7. why not YARN? The best answer I could find, is Joydeep Sen Sarma’s on Quora. Summarized:
    1. Corona uses a push-based, event-driven, callback oriented message flow
    2. Corona’s JobTracker can run in the same VM with the Job Client
    3. Corona integrated with the Hadoop trunk Fair-Scheduler which got rewritten at Facebook
    4. Corona’s resource manager uses optimistic locking
    5. Corona’s using Thrift, while others are looking at using Protobuf or Avro

Original title and link: Facebook Corona: A Different Approach to Job Scheduling and Resource Management (NoSQL database©myNoSQL)

What Can You Do With YARN and Mesos?

Edward Capriolo asks some very good questions about the use and advantages of Apache Hadoop YARN and Mesos: :

  1. Can a technology like YARN or Mesos be used together with puppet or chef?
    1. What at the best practices when using these two things together?
  2. In YARNs case. How many current software packages can YARN manage outside hadoop?
    1. MPI?
    2. Then what?
  3. Aren’t YARN/Mesos just sneaky forms of devops/noops?
  4. With clusters spinning up and falling on command how do we monitor this environment and guarantee quality of service?
  5. Couldn’t AWS/open stack do this on a more general scale?
  6. Shouldn’t we just all be using solaris zones?

When I first learned about YARN—I still need to get more familiar with Mesos beyond Jay Kreps’s YARN and Mesos comparison—I had a much simpler question: How exactly would you use YARN?

I still don’t have a good answer to my question, but now we have a couple more specific ones. Maybe someone could help us out.

Original title and link: What Can You Do With YARN and Mesos? (NoSQL database©myNoSQL)


The Future of Hadoop: YARN Explained

You’ve probably read about the central goals of YARN and seen the architecture of YARN, but it’s worth having as many details about it as possible:

Hadoop YARN Architecture

A key paragraph in Arun Murthy’s post about Apache Hadoop YARN:

MapReduce is great for many applications, but not everything; other programming models better serve requirements such graph processing (Google Pregel / Apache Giraph) and iterative modeling (MPI). When all the data in the enterprise is already available in Hadoop HDFS, multiple paths for processing data is critical.

That’s for all the critiques Hadoop is getting.

Original title and link: The Future of Hadoop: YARN Explained (NoSQL database©myNoSQL)


PaaS on Hadoop Yarn - Idea and Prototype

Related to the earlier Hadoop YARN, Beyond MapReduce, here’s a very interesting experiment by the SAP Labs:

This post describes a prototype implementation of a simple PAAS built on the Hadoop YARN framework and the key findings from the experiment. While there are some advantages to using Hadoop YARN, there is at least one unsolved issue that would be difficult to overcome at this point.


Code of this experiment was made available on GitHub.

Original title and link: PaaS on Hadoop Yarn - Idea and Prototype (NoSQL database©myNoSQL)


Hadoop YARN - Beyond MapReduce

In a conversation with Curt Monash, Arun Murthy (Hortonworks) explains what YARN (aka Hadoop MapReduce 2.0 or MRv2) is about:

YARN, as an aspect of Hadoop, has two major kinds of benefits:

  1. The ability to use programming frameworks other than MapReduce.
  2. Scalability, no matter what programming framework you use.


The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

  • Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
  • Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

Original title and link: Hadoop YARN - Beyond MapReduce (NoSQL database©myNoSQL)


New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained

If your job or interest has anything to do with Hadoop, this is the article you want to print out and understand every details of it (nb: I’m still working on the second part).

Hadoop MapReduce 2.0 YARN MRv2

Original title and link: New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained (NoSQL database©myNoSQL)


How Does YARN (NextGen MapReduce) and Mesos Compare?

Jay Kreps (LinkedIn) provides an excellent response to the question in the title on a Quora thread:

  1. Java vs C++
  2. Memory scheduling vs both memory and CPU scheduling
  3. Unix processes vs Linux container groups
  4. Pull vs push resource request model
  5. 3x more code in YARN
  6. YARN integrated pluggable schedulers vs Mesos’ upcoming hierarchical scheduling
  7. YARN integrates with Kerberos and inherits Hadoop security
  8. YARN provides rach and machine locality out of the box vs Mesos allowing to implement these
  9. YARN is still under work vs Mesos being a mature project
  10. YARN is the next generation of Hadoop MapReduce so you’ll be able to use on your Hadoop cluster
  11. YARN is written by Yahoo/HortonWorks which has shown to be experienced with multi-tenancy and very large-scale cluster computing. But YARN still needs work and testing.
  12. Mesos ships with a number of out-of-the-box frameworks.

You can find out more about Mesos here and about YARN here. And if you have a Quora account go upvote Jay’s answer.

Original title and link: How Does YARN (NextGen MapReduce) and Mesos Compare? (NoSQL database©myNoSQL)