YARN: All content tagged as YARN in NoSQL databases and polyglot persistence
Tuesday, 20 November 2012
Facebook Corona: A Different Approach to Job Scheduling and Resource Management
Facebook engineering: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona:
It was pretty clear that we would ultimately need a better scheduling framework that would improve this situation in the following ways:
- Better scalability and cluster utilization
- Lower latency for small jobs
- Ability to upgrade without disruption
- Scheduling based on actual task resource requirements rather than a count of map and reduce tasks
-
Hadoop deployment at Facebook:
-
100PB
-
60000 Hive queries/day
- used by > 1000 people
Is Hive the preferred way Hadoop is used at Facebook?
-
-
Facebook is running it’s own version of HDFS. Once you fork, integrating upstream changes becomes a nightmare.
- How to deploy and test new features at scale: rank types of users and roll out the new feature starting with the less critical scenarios. You must be able to correctly route traffic or users.
- At scale, cluster utilization is a critical metric. All the improvements in Corona are derived from this.
- Traditional analytic databases have advanced resource-based scheduling for a long time. Hadoop needs this.
- Open source at Facebook:
- create a tool that addresses an internal problem
open source itthrow it out in the wild (nb: is there any Facebook open source project they continued to maintain?)- Option 1: continue to develop it internally. Option 2: drop it
- if by any chance the open source project survives and becomes a standalone project, catch up from time to time
- re-fork it
- why not YARN? The best answer I could find, is Joydeep Sen Sarma’s on Quora. Summarized:
- Corona uses a push-based, event-driven, callback oriented message flow
- Corona’s JobTracker can run in the same VM with the Job Client
- Corona integrated with the Hadoop trunk Fair-Scheduler which got rewritten at Facebook
- Corona’s resource manager uses optimistic locking
- Corona’s using Thrift, while others are looking at using Protobuf or Avro
Original title and link: Facebook Corona: A Different Approach to Job Scheduling and Resource Management (©myNoSQL)
Thursday, 9 August 2012
What Can You Do With YARN and Mesos?
Edward Capriolo asks some very good questions about the use and advantages of Apache Hadoop YARN and Mesos: :
- Can a technology like YARN or Mesos be used together with puppet or chef?
- What at the best practices when using these two things together?
- In YARNs case. How many current software packages can YARN manage outside hadoop?
- MPI?
- Then what?
- Aren’t YARN/Mesos just sneaky forms of devops/noops?
- With clusters spinning up and falling on command how do we monitor this environment and guarantee quality of service?
- Couldn’t AWS/open stack do this on a more general scale?
- Shouldn’t we just all be using solaris zones?
When I first learned about YARN—I still need to get more familiar with Mesos beyond Jay Kreps’s YARN and Mesos comparison—I had a much simpler question: How exactly would you use YARN?
I still don’t have a good answer to my question, but now we have a couple more specific ones. Maybe someone could help us out.
Original title and link: What Can You Do With YARN and Mesos? (©myNoSQL)
via: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/trying_to_find_a_fit
The Future of Hadoop: YARN Explained
You’ve probably read about the central goals of YARN and seen the architecture of YARN, but it’s worth having as many details about it as possible:
A key paragraph in Arun Murthy’s post about Apache Hadoop YARN:
MapReduce is great for many applications, but not everything; other programming models better serve requirements such graph processing (Google Pregel / Apache Giraph) and iterative modeling (MPI). When all the data in the enterprise is already available in Hadoop HDFS, multiple paths for processing data is critical.
That’s for all the critiques Hadoop is getting.
Original title and link: The Future of Hadoop: YARN Explained (©myNoSQL)
via: http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
Monday, 23 July 2012
PaaS on Hadoop Yarn - Idea and Prototype
Related to the earlier Hadoop YARN, Beyond MapReduce, here’s a very interesting experiment by the SAP Labs:
This post describes a prototype implementation of a simple PAAS built on the Hadoop YARN framework and the key findings from the experiment. While there are some advantages to using Hadoop YARN, there is at least one unsolved issue that would be difficult to overcome at this point.
Code of this experiment was made available on GitHub.
Original title and link: PaaS on Hadoop Yarn - Idea and Prototype (©myNoSQL)
via: http://jaigak.blogspot.com/2012/07/paas-on-hadoop-yarn-idea-and-prototype.html
Sunday, 22 July 2012
Hadoop YARN - Beyond MapReduce
In a conversation with Curt Monash, Arun Murthy (Hortonworks) explains what YARN (aka Hadoop MapReduce 2.0 or MRv2) is about:
YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
[…]
The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:
- Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
- Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.
Original title and link: Hadoop YARN - Beyond MapReduce (©myNoSQL)
via: http://www.dbms2.com/2012/07/23/hadoop-yarn-beyond-mapreduce/
Wednesday, 7 March 2012
New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained
If your job or interest has anything to do with Hadoop, this is the article you want to print out and understand every details of it (nb: I’m still working on the second part).

Original title and link: New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained (©myNoSQL)
via: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
Tuesday, 6 September 2011
How Does YARN (NextGen MapReduce) and Mesos Compare?
Jay Kreps (LinkedIn) provides an excellent response to the question in the title on a Quora thread:
- Java vs C++
- Memory scheduling vs both memory and CPU scheduling
- Unix processes vs Linux container groups
- Pull vs push resource request model
- 3x more code in YARN
- YARN integrated pluggable schedulers vs Mesos’ upcoming hierarchical scheduling
- YARN integrates with Kerberos and inherits Hadoop security
- YARN provides rach and machine locality out of the box vs Mesos allowing to implement these
- YARN is still under work vs Mesos being a mature project
- YARN is the next generation of Hadoop MapReduce so you’ll be able to use on your Hadoop cluster
- YARN is written by Yahoo/HortonWorks which has shown to be experienced with multi-tenancy and very large-scale cluster computing. But YARN still needs work and testing.
- Mesos ships with a number of out-of-the-box frameworks.
You can find out more about Mesos here and about YARN here. And if you have a Quora account go upvote Jay’s answer.
Original title and link: How Does YARN (NextGen MapReduce) and Mesos Compare? (©myNoSQL)

