NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



yahoo: All content tagged as yahoo in NoSQL databases and polyglot persistence

Pig: Making Hadoop Easy

A short slide deck on the advantages of using Pig instead of MapReduce:

Pig and Hive at Yahoo gives more details about the data processing flow and the places where Pig and Hive fit in.

Original title and link for this post: Pig: Making Hadoop Easy (published on the NoSQL blog: myNoSQL)

High Availability MySQL at Yahoo!

Jay Jenssen talks about Yahoo!’s approach

Now, what makes our solution different? Not much. The layout is this: two master databases, one in each of our two colocations. These masters replicate from each other, but we would never have more than two masters in this replication loop for the same reason we don’t use token ring networks today: one master outage would break replication in a chain of size > 2. Our slaves replicate from one of the two masters, often half of the slaves in a given colocation replicate from one of the masters, and half from the other master.

But there is much more in the original article (e.g. allowing writes to a single master, dealing with failure, etc.). There are also three slide decks on infrastructure resiliency, high availability/business continuity planning, and application resiliency.

Infrastructure resiliency at Yahoo

High availability/Business continuity planning at Yahoo

Application resiliency at Yahoo

It doesn’t sound so exciting as what Google is doing, or Facebook, but it is probably something many could learn from.

Original title and link for this post: High Availability MySQL at Yahoo! (published on the NoSQL blog: myNoSQL)


Pig and Hive at Yahoo!

Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:

The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.

The use cases mentioned in the post:

  • data preparation and presentation:

    Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

  • data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
  • data warehouse: business-intelligence analysis and ad-hoc queries

    In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.

Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.

Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)


Howl: Unifying Metadata Layer for Hive and Pig

Yet another contribution from Yahoo!:

Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive

Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL


Hadoop User Group March Meeting Recap

The meeting hosted lots of discussions and 3 presentations:

Owen O’Malley: Upcoming Hadoop Security release

Owen O’Malley from the Yahoo! Hadoop Team provided an overview of the upcoming Hadoop Security release. Owen described the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.

Tyson Condie: Hadoop Online

Tyson Condie a Ph.D. student at the University of California, Berkeley, presented the innovative research around Hadoop Online efforts lead by Prof. Joseph M. Hellerstein . Tyson described a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, can reduce completion times and improve system utilization. Tyson included examples from the HOP - Hadoop Online Prototype project.

Bradford Cross: Flightcaster

Bradford Cross from Flightcaster provided an exciting overview on the FlightCaster flight delays prediction service and some cool insights into the airline industry. Bradford described how they built a scalable machine learning and data analysis platform using Clojure dynamic programming language wrapping Cascading and Hadoop. Bradford demonstrated how the use of Hadoop makes building scalable systems much simpler