NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



yahoo: All content tagged as yahoo in NoSQL databases and polyglot persistence

Hortonworks’ Hadoop secret weapon is... Yahoo

Derrick Harris:

Hortonworks was working right alongside Yahoo all through that process. They’ve also worked together on things like rolling upgrades so Hadoop users can upgrade software without taking down a cluster.

  1. who didn’t know about Hortonworks and Yahoo’s collaboration?
  2. what company and product management team would choose not to work with one of the largest user of the technology it is working on?

    This is the perfect example of testing and validating new ideas, learning about the pain your customers are facing in real life. Basically by the book product/market fit.

Original title and link: Hortonworks’ Hadoop secret weapon is… Yahoo (NoSQL database©myNoSQL)


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)


Which Big Data Company Has the World's Biggest Hadoop Cluster?

Jimmy Wong:

Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure.


Unfortunately the data available is sooo little and soooo old.

Original title and link: Which Big Data Company Has the World’s Biggest Hadoop Cluster? (NoSQL database©myNoSQL)


Groundhog: Hadoop Automated Testing at Yahoo!

Yahoo! and probably all other large installations of Hadoop have to deal with upgrading their Hadoop clusters. Scheduled rolling upgrades is the strategy applied everywhere, but depending on the size of the cluster this can take way too much time. Yahoo! has developed internally an interesting tool that can help with Hadoop upgrades:

Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results.

Groundhog Hadoop Yahoo

This is the sort of tool I always wanted for most of the applications I’ve developed: a system able to capture complete or percentage of the real traffic and then replay it. At every layer of the application.

Original title and link: Groundhog: Hadoop Automated Testing at Yahoo! (NoSQL database©myNoSQL)


Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies

Sarah Lacy:

The technologies in question include things like memcached which was created in 2003 by LiveJournal and has been used longer than Facebook has been alive.[…]

Other examples include Open Compute, an open hardware project started by Facebook that focuses on low-cost, energy efficient server and data center hardware; Tornado a Python-based web server used for building real-time Web services; and HPHP, a source code transformer that turns PHP into C++.

I have no other details about this patent letter Yahoo sent Facebook, but I seriously doubt it targets these technologies separately. Most probably it refers to some sort of combinations of these and one that Facebook has mentioned as part of their IP.

Original title and link: Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies (NoSQL database©myNoSQL)


Yahoo! Sherpa: Status and Advances

Information about Yahoo! PNUTS/Sherpa is so rare. Except the original PNUTS architecture paper (PDF) and the Sherpa: Cloud Computing of the Third Kind slides (PDF), it’s difficult to find something else. But in September, the Yahoo! Developer blog posted an update about Sherpa.

Sherpa Status

  • 500+ Sherpa tables
  • 50,000+ Sherpa tablets(shards) in operation
  • tablets can be copied, moved, or split dynamically
  • multi-data center
  • multi-tenant: it must support different ranges of read/write ratios
  • 75000 requests per second
  • supports heterogeneous servers
    • some are SSD exclusively
  • Sherpa users have access to monitoring tools to examine their app latency and throughput SLA. This implies each application could negotiate different SLAs with the infrastructure team.
  • Sherpa default storage engine is MySQL/InnoDB
    • storage access have been abstracted
    • Yahoo! tested BDB, BDB-Java, and Log-Structured Merge (LSM) Tree developed by Yahoo Labs backends
  • Sherpa relies on a reliable messaging system
    • can guarantee reliable in-colo and cross-colo transactions

Sherpa Advances

Selective Record Replication

  • previous versions supported Table-level replication
  • new version support Record-level replication
  • designed for efficiency (minimize costs of transfer and storage) and legality (ensure copyright regulations, other legal concerns)
  • replication locations are declarative/static
  • Sherpa maintains a ‘stub’ of the record in locations that do not have a full copy
    • the stub is updated only when a record is created, deleted, or when replication rules change
    • the stub is used for routing requests


  • Support for full table backups has been added
  • Point-in-time recovery planned
  • Also planned full, cross-colo, and automatic table restoration

Task Manager using Sherpa Tables for Task State

  • Sherpa has added a general workflow manager to execute long-running tasks
  • It is used for backup and restore operations

The complete post can be read here.

Considering Yahoo! has always been a big proponent of open source projects, it is a pitty that we don’t have the chance to hear more often and more details about PNUTS/Sherpa.

Original title and link: Yahoo! Sherpa: Status and Advances (NoSQL database©myNoSQL)

Hadoop Market: Hortonworks’ Positioning

Eric Baldeschweiler in a recent briefing—transcript by Bert Latamore over Wikibon:

We’re really committed to building out Apache Hadoop and doing it in the Open Source community, so what really differentiates us is being really committed, besides shipping 100% pure Apache Hadoop code, which nobody else does, to taking a very partnering ecosystem-centric approach.[…] We’re the only ones committed to shipping Apache Hadoop code. We’ve been the drivers behind every major release of Apache Hadoop since its inception. Other companies are packaging and distributing Hadoop, but when they do that they add lots of their own custom stuff, both as patches to the Apache Hadoop distribution and also as independent products. A lot of that work is going into Apache, and since we committed to the Open Source model we’ve seen a lot more third party code going into Apache, which is obviously a win for the community. But to date no other company is actually taking releases from Apache & supporting them. They create their own versions that are slightly different from what comes from Apache, and try to build a business around that.

The political message from both Cloudera and Hortonworks is “we compete as businesses, but collaborate for the good of Hadoop“. But behind the curtains, they both prepare the big guns.

Original title and link: Hadoop Market: Hortonworks’ Positioning (NoSQL database©myNoSQL)

Memcached and Sherpa for Yahoo! News Activity Data Service

Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

Mixer - Memcached Sherpa Yahoo News Activity

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):

Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.

Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table

Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (NoSQL database©myNoSQL)


Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others

Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.

  • Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.

  • Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day

  • University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster

  • Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software

  • eBay: has 3 separate analytics environments:

    • 6PB data warehouse for structured data and SQL access
    • 40PB deep analytics (Teradata)
    • 20PB Hadoop system to support advanced analytic workload on unstructured data

Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others (NoSQL database©myNoSQL)

Aster Data SQL-MapReduce Technology Patent

From a Teradata PR announcement:

SQL-MapReduce® is a framework which enables fast, investigative analysis of complex information by data scientists and business analysts. It enables procedural expressions in software languages (such as Java, C#, Python, C++, and R) to be parallelized across a group of linked computers (compute cluster) and then activated for use (invoked) with standard SQL.  

The closest open source solution I can think of is Pig , created and open sourced by Yahoo! (PDF).

Original title and link: Aster Data SQL-MapReduce Technology Patent (NoSQL database©myNoSQL)

Yahoo Launches Hadoop Spinoff

GigaOm breaks the news of the Yahoo! Hadoop engineering spinoff, HortonWorks:

By incorporating next-generation features and capabilities, HortonWorks hopes to make Hadoop easier to consume and better suited for running production workloads. Its products, which likely will include higher-level management tools on top of the core MapReduce and file system layers, will be open source and HortonWorks will try to maintain a close working relationship with Apache. The goal is to make HortonWorks the go-to vendor for a production-ready Hadoop distribution and support, but also to advance Yahoo’s repeated mission of making the official Apache Hadoop distribution the place to go for core software. Earlier this year, Yahoo discontinued its own Hadoop distribution, recommitting all that code and all its development efforts to Apache.

Judging by all the cool projects the Yahoo! engineering team has created, I’ve already said this makes a lot of sense to me.

Original title and link: Yahoo Launches Hadoop Spinoff (NoSQL database©myNoSQL)


Yahoo Could Spinoff Hadoop Software Unit


Yahoo is now weighing spinning off its Hadoop engineering unit into a new firm that would continue to develop the free software and charge companies for its expertise in using it, according to people familiar with the matter.

Google news search[1]

Makes a lot of sense to me.


  1. Link circumventing the WSJ paywall  

Original title and link: Yahoo Could Spinoff Hadoop Software Unit (NoSQL databases © myNoSQL)