ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

YARN: All content tagged as YARN in NoSQL databases and polyglot persistence

Facebook Corona: A Different Approach to Job Scheduling and Resource Management

Facebook engineering: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona:

It was pretty clear that we would ultimately need a better scheduling framework that would improve this situation in the following ways:

  • Better scalability and cluster utilization
  • Lower latency for small jobs
  • Ability to upgrade without disruption
  • Scheduling based on actual task resource requirements rather than a count of map and reduce tasks
  1. Hadoop deployment at Facebook:

    • 100PB

    • 60000 Hive queries/day

    • used by > 1000 people

    Is Hive the preferred way Hadoop is used at Facebook?

  2. Facebook is running it’s own version of HDFS. Once you fork, integrating upstream changes becomes a nightmare.

  3. How to deploy and test new features at scale: rank types of users and roll out the new feature starting with the less critical scenarios. You must be able to correctly route traffic or users.
  4. At scale, cluster utilization is a critical metric. All the improvements in Corona are derived from this.
  5. Traditional analytic databases have advanced resource-based scheduling for a long time. Hadoop needs this.
  6. Open source at Facebook:
    1. create a tool that addresses an internal problem
    2. open source it throw it out in the wild (nb: is there any Facebook open source project they continued to maintain?)
    3. Option 1: continue to develop it internally. Option 2: drop it
    4. if by any chance the open source project survives and becomes a standalone project, catch up from time to time
    5. re-fork it
  7. why not YARN? The best answer I could find, is Joydeep Sen Sarma’s on Quora. Summarized:
    1. Corona uses a push-based, event-driven, callback oriented message flow
    2. Corona’s JobTracker can run in the same VM with the Job Client
    3. Corona integrated with the Hadoop trunk Fair-Scheduler which got rewritten at Facebook
    4. Corona’s resource manager uses optimistic locking
    5. Corona’s using Thrift, while others are looking at using Protobuf or Avro

Original title and link: Facebook Corona: A Different Approach to Job Scheduling and Resource Management (NoSQL database©myNoSQL)


What Can You Do With YARN and Mesos?

Edward Capriolo asks some very good questions about the use and advantages of Apache Hadoop YARN and Mesos: :

  1. Can a technology like YARN or Mesos be used together with puppet or chef?
    1. What at the best practices when using these two things together?
  2. In YARNs case. How many current software packages can YARN manage outside hadoop?
    1. MPI?
    2. Then what?
  3. Aren’t YARN/Mesos just sneaky forms of devops/noops?
  4. With clusters spinning up and falling on command how do we monitor this environment and guarantee quality of service?
  5. Couldn’t AWS/open stack do this on a more general scale?
  6. Shouldn’t we just all be using solaris zones?

When I first learned about YARN—I still need to get more familiar with Mesos beyond Jay Kreps’s YARN and Mesos comparison—I had a much simpler question: How exactly would you use YARN?

I still don’t have a good answer to my question, but now we have a couple more specific ones. Maybe someone could help us out.

Original title and link: What Can You Do With YARN and Mesos? (NoSQL database©myNoSQL)

via: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/trying_to_find_a_fit


The Future of Hadoop: YARN Explained

You’ve probably read about the central goals of YARN and seen the architecture of YARN, but it’s worth having as many details about it as possible:

Hadoop YARN Architecture

A key paragraph in Arun Murthy’s post about Apache Hadoop YARN:

MapReduce is great for many applications, but not everything; other programming models better serve requirements such graph processing (Google Pregel / Apache Giraph) and iterative modeling (MPI). When all the data in the enterprise is already available in Hadoop HDFS, multiple paths for processing data is critical.

That’s for all the critiques Hadoop is getting.

Original title and link: The Future of Hadoop: YARN Explained (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/


PaaS on Hadoop Yarn - Idea and Prototype

Related to the earlier Hadoop YARN, Beyond MapReduce, here’s a very interesting experiment by the SAP Labs:

This post describes a prototype implementation of a simple PAAS built on the Hadoop YARN framework and the key findings from the experiment. While there are some advantages to using Hadoop YARN, there is at least one unsolved issue that would be difficult to overcome at this point.

PaaS-on-YARN-architecture

Code of this experiment was made available on GitHub.

Original title and link: PaaS on Hadoop Yarn - Idea and Prototype (NoSQL database©myNoSQL)

via: http://jaigak.blogspot.com/2012/07/paas-on-hadoop-yarn-idea-and-prototype.html


Hadoop YARN - Beyond MapReduce

In a conversation with Curt Monash, Arun Murthy (Hortonworks) explains what YARN (aka Hadoop MapReduce 2.0 or MRv2) is about:

YARN, as an aspect of Hadoop, has two major kinds of benefits:

  1. The ability to use programming frameworks other than MapReduce.
  2. Scalability, no matter what programming framework you use.

[…]

The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

  • Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
  • Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

Original title and link: Hadoop YARN - Beyond MapReduce (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2012/07/23/hadoop-yarn-beyond-mapreduce/


New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained

If your job or interest has anything to do with Hadoop, this is the article you want to print out and understand every details of it (nb: I’m still working on the second part).

Hadoop MapReduce 2.0 YARN MRv2

Original title and link: New Hadoop MapReduce 2.0 (MRv2 or YARN) Explained (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/


How Does YARN (NextGen MapReduce) and Mesos Compare?

Jay Kreps (LinkedIn) provides an excellent response to the question in the title on a Quora thread:

  1. Java vs C++
  2. Memory scheduling vs both memory and CPU scheduling
  3. Unix processes vs Linux container groups
  4. Pull vs push resource request model
  5. 3x more code in YARN
  6. YARN integrated pluggable schedulers vs Mesos’ upcoming hierarchical scheduling
  7. YARN integrates with Kerberos and inherits Hadoop security
  8. YARN provides rach and machine locality out of the box vs Mesos allowing to implement these
  9. YARN is still under work vs Mesos being a mature project
  10. YARN is the next generation of Hadoop MapReduce so you’ll be able to use on your Hadoop cluster
  11. YARN is written by Yahoo/HortonWorks which has shown to be experienced with multi-tenancy and very large-scale cluster computing. But YARN still needs work and testing.
  12. Mesos ships with a number of out-of-the-box frameworks.

You can find out more about Mesos here and about YARN here. And if you have a Quora account go upvote Jay’s answer.

Original title and link: How Does YARN (NextGen MapReduce) and Mesos Compare? (NoSQL database©myNoSQL)