YARN: All content tagged as YARN in NoSQL databases and polyglot persistence
Even if there’s been almost 3 weeks since the announcement, Apache Hadoop 2 is too big of a news not to mention it here. If you want to read something about it, here are a couple of links:
What started out a few years ago as a scalable batch processing system for Java programmers has now emerged as the kernel of the operating system for big data.
A short interview with Rohit Bakhshi (product manager at Hortonwork) YARN Brings New Capabilities To Hadoop:
By turning Apache Hadoop 2.0 into a multi- application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.
Mike Miller’s post on GigaOm: Why the world should care about Hadoop 2:
This might be surprising, because Hadoop 2 is not a blow-your-socks-off release. It is not packed with revolutionary new features from a user perspective. Instead, its greatest innovation is a glorious refactoring of some internal plumbing. But that plumbing grants the community of Hadoop developers the pathways they need to address some of Hadoops greatest shortcomings in comparison to both the commercial and the internal Google tools that Hadoop was derived from.
Last but not least, any article you can find about YARN and signed Aarun C. Murthy will be well worth reading (e.g. Apache Hadoop YARN – Background and an Overview, old but very very details series about YARN’s objectives, or Moving Hadoop Beyond Batch with Apache YARN
Original title and link: Apache Hadoop 2 - YARN is GA ( ©myNoSQL)
Facebook engineering: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona:
It was pretty clear that we would ultimately need a better scheduling framework that would improve this situation in the following ways:
- Better scalability and cluster utilization
- Lower latency for small jobs
- Ability to upgrade without disruption
- Scheduling based on actual task resource requirements rather than a count of map and reduce tasks
Hadoop deployment at Facebook:
60000 Hive queries/day
- used by > 1000 people
Is Hive the preferred way Hadoop is used at Facebook?
Facebook is running it’s own version of HDFS. Once you fork, integrating upstream changes becomes a nightmare.
- How to deploy and test new features at scale: rank types of users and roll out the new feature starting with the less critical scenarios. You must be able to correctly route traffic or users.
- At scale, cluster utilization is a critical metric. All the improvements in Corona are derived from this.
- Traditional analytic databases have advanced resource-based scheduling for a long time. Hadoop needs this.
- Open source at Facebook:
- create a tool that addresses an internal problem
open source itthrow it out in the wild (nb: is there any Facebook open source project they continued to maintain?)
- Option 1: continue to develop it internally. Option 2: drop it
- if by any chance the open source project survives and becomes a standalone project, catch up from time to time
- re-fork it
- why not YARN? The best answer I could find, is Joydeep Sen Sarma’s on Quora. Summarized:
- Corona uses a push-based, event-driven, callback oriented message flow
- Corona’s JobTracker can run in the same VM with the Job Client
- Corona integrated with the Hadoop trunk Fair-Scheduler which got rewritten at Facebook
- Corona’s resource manager uses optimistic locking
- Corona’s using Thrift, while others are looking at using Protobuf or Avro
Original title and link: Facebook Corona: A Different Approach to Job Scheduling and Resource Management ( ©myNoSQL)
- Java vs C++
- Memory scheduling vs both memory and CPU scheduling
- Unix processes vs Linux container groups
- Pull vs push resource request model
- 3x more code in YARN
- YARN integrated pluggable schedulers vs Mesos’ upcoming hierarchical scheduling
- YARN integrates with Kerberos and inherits Hadoop security
- YARN provides rach and machine locality out of the box vs Mesos allowing to implement these
- YARN is still under work vs Mesos being a mature project
- YARN is the next generation of Hadoop MapReduce so you’ll be able to use on your Hadoop cluster
- YARN is written by Yahoo/HortonWorks which has shown to be experienced with multi-tenancy and very large-scale cluster computing. But YARN still needs work and testing.
- Mesos ships with a number of out-of-the-box frameworks.
Original title and link: How Does YARN (NextGen MapReduce) and Mesos Compare? ( ©myNoSQL)