Hadoop: All content tagged as Hadoop in NoSQL databases and polyglot persistence
In 4 years of writing this blog I haven’t seen such a prolific month:
- Apache Hadoop 2.2.0 (more links here)
- Apache HBase 0.96 (here and here)
- Apache Hive 0.12 (more links here)
- Apache Ambari 1.4.1
- Apache Pig 0.12
- Apache Oozie 4.0.0
- Plus Presto.
Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.
Original title and link: A prolific season for Hadoop and its ecosystem ( ©myNoSQL)
And at the end of October, Hortonworks has shared a new set of results:
Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.
Original title and link: Status update on Project Stinger, the interactive query for Apache Hive ( ©myNoSQL)
I’ve learned that there’s an Apache Hadoop compatibility guide that covers API, wire, Java binary compatibility, any many other such aspects.
✚ Karthik Kambatla posted on Cloudera’s blog Writing Hadoop programs that work across releases that looks at the Hadoop API annotations and compatibility policies.
Original title and link: Apache Hadoop Compatibility Guide ( ©myNoSQL)
At the time I’m reading this Ask HN: To everybody who uses MapReduce: what problems do you solve?, there aren’t many interesting answers.
✚ Compare it with AskReddit: What is an invention that the human race is fully capable of making, but hasn’t been made yet?
Original title and link: To everybody who uses MapReduce: what problems do you solve? ( ©myNoSQL)
Even if there’s been almost 3 weeks since the announcement, Apache Hadoop 2 is too big of a news not to mention it here. If you want to read something about it, here are a couple of links:
What started out a few years ago as a scalable batch processing system for Java programmers has now emerged as the kernel of the operating system for big data.
A short interview with Rohit Bakhshi (product manager at Hortonwork) YARN Brings New Capabilities To Hadoop:
By turning Apache Hadoop 2.0 into a multi- application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.
Mike Miller’s post on GigaOm: Why the world should care about Hadoop 2:
This might be surprising, because Hadoop 2 is not a blow-your-socks-off release. It is not packed with revolutionary new features from a user perspective. Instead, its greatest innovation is a glorious refactoring of some internal plumbing. But that plumbing grants the community of Hadoop developers the pathways they need to address some of Hadoops greatest shortcomings in comparison to both the commercial and the internal Google tools that Hadoop was derived from.
Last but not least, any article you can find about YARN and signed Aarun C. Murthy will be well worth reading (e.g. Apache Hadoop YARN – Background and an Overview, old but very very details series about YARN’s objectives, or Moving Hadoop Beyond Batch with Apache YARN
Original title and link: Apache Hadoop 2 - YARN is GA ( ©myNoSQL)