ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Hadoop for graphs - GraphLab picks up $6.75m from Madrona and NEA

Robin Wauters for TNW:

Seattle startup GraphLab claims it is building the “fastest machine-learning analytics engine for graph datasets”, based on the popular open-source distributed graph computation framework with the same name, and it has just raised capital to come through on its promise.

Good luck to GraphLab’s team.

✚ Here’s a short list of MapReduce implementations for graphs.

Original title and link: Hadoop for graphs - GraphLab picks up $6.75m from Madrona and NEA (NoSQL database©myNoSQL)

via: http://thenextweb.com/insider/2013/05/14/graphlab-funding/


Hadoop, Moh's Law and Corollaries

Robert Novak’s proposes Moh’s law and Rob’s corollaries to Hadoop and Big Data:

  1. Hadoop is hard.
  2. Make sure your’re measuring what you think you’re measuring.
  3. Make sure you’re measuring what you need to be measuring.

For the first, I’m somehow confident that Cloudera and Hortonworks and others will finally solve it over time. But for the latter you are the only responsible. Not even a SaaS can save you.

Original title and link: Hadoop, Moh’s Law and Corollaries (NoSQL database©myNoSQL)

via: https://rsts11.wordpress.com/2013/05/14/mohs-law-and-big-data-rsts11/


What Open Source Hadoop Coming to Windows Means to IT

This will open up Hadoop to a large number of organizations that have no in- house Linux skills. Shaun Connolly, vice president of Corporate Strategy at Hortonworks, explains the thinking behind moving HDP to Windows in this way: “Essentially it’s a market-driven decision,” he says. “Hadoop is built for the scaleout commodity hardware market, and the commodity hardware market is 70% Windows by install base and expertise.”

Employees in Windows-only companies will be able to make use of Hadoop easily because Excel can be used as a business intelligence tool to view the results of Hadoop Big Data analysis (whether Hadoop is running on Windows or Linux). “Ideally we want Microsoft users to be oblivious to the fact that everything is coming from Hadoop,” says Connolly. “If end users can consume data without any learning curve, thanks to tools like Excel, then they get more value.”

Either the data or the logic above is not sound:

  1. those Windows machines that make up the 70% of the market are probably running Excel
  2. those 70% of the market Windows machines are not going to run Hadoop

Based on this sort of market-share decisions, tomorrow we should see Hadoop for iOS and Android and Nokia. Sometime soon Microsoft will release Excel for iOS and maybe Android.

Original title and link: What Open Source Hadoop Coming to Windows Means to IT (NoSQL database©myNoSQL)

via: http://www.cio.com/article/733260/What_Open_Source_Hadoop_Coming_to_Windows_Means_to_IT


Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster

I didn’t know what to think of this announcement after reading the WSJ title . After checking the project GitHub page, I still don’t know what to make of it.

Original title and link: Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster (NoSQL database©myNoSQL)


Hadoop Drives Down Costs

Darryl K. Taft reporting the experience of using Hadoop at UC Irvine Medical Center:

Because they were bleeding money, the team wanted a cost-effective solution. “Our target was $500 per terabyte. We were at $100,000 per terabyte with the old system,” Peterson said. “With our Hadoop cluster, we’re now at $900 per terabyte.”

How are these costs calculated?

  1. Fixed costs: hardware, any one time licenses
  2. Recurring costs: hardware replacement, energy, HR

Is this all?

Original title and link: Hadoop Drives Down Costs (NoSQL database©myNoSQL)

via: http://www.eweek.com/print/cloud/hadoop-drives-down-costs-drives-up-usability-with-sql-convergence/


Impala 1.0 - That was fast

Cloudera announces Impala 1.0 GA release.

That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service

Original title and link: Impala 1.0 - That was fast (NoSQL database©myNoSQL)


Hadoop Virtualization

Roberto V. Zicari interviewing Joe Russell1 about Hadoop virtualization with Serengeti:

A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.

I’m not 100% sure I get this, but the way I explained it to myself to actually make sense this would mean that HDFS lives directly on the physical hardware and only the compute part is virtualized. Is that what he means?


  1. Joe Russell is Product Line Marketing Manager at VMware. 

Original title and link: Hadoop Virtualization (NoSQL database©myNoSQL)

via: http://www.odbms.org/blog/2013/04/on-virtualize-hadoop-interview-with-joe-russell/


Project Savanna: Hadoop and OpenStack

Timothy Prickett Morgan for The Register about Project Savanna, a collaboration between Mirantis, Hortonworks, and Red Hat:

Batman and Robin. Peanut butter and chocolate. OpenStack and Hadoop. These are things that go together, with the latter pairing being something that commercial OpenStack distie Mirantis, commercial Hadoop distie Hortonworks, and commercial KVM and Linux distie (and soon to be OpenStack commercializer) Red Hat are putting together under a new OpenStack effort dubbed Project Savanna.

Hadoop is at the age where everyone tries to package it and claim they’ll be the Red Hat of the Hadoop ecosystem. I cannot really dot the i-s and cross the t-s, but my gut feeling is that right now all these are actually more similar to the attempts of bringing Linux to the desktop.

We know how successful these have been so far.

Original title and link: Project Savanna: Hadoop and OpenStack (NoSQL database©myNoSQL)

via: http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/


Project Falcon: Tackling Hadoop Data Lifecycle Management

Venkatesh Seetharam announcing a new Apache incubating project in the Hadoop ecosystem open sourced by InMobi and Hortonworks:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

I think this diagram describes Project Falcon best:

Project Falcon at a Glance

✚ Was there any other project addressing this space?

Original title and link: Project Falcon: Tackling Hadoop Data Lifecycle Management (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html


Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility

Ofer Mendelevitch for Hortonworks blog:

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

Most often when speaking about Hadoop, people refer to costs (commodity servers), parallelism and scalability. I do not remember how many times I’ve written that the main difference between Hadoop and traditional data warehouses is in the agility it offers.

One Hadoop tagline could be: “collect data today. analyse it when and how you want“.

Original title and link: Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hadoop-data-agility/


Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler

Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).

✚ Datanami has published a summary of the keynote here

Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (NoSQL database©myNoSQL)