yahoo: All content tagged as yahoo in NoSQL databases and polyglot persistence
Monday, 12 December 2011
Yahoo! Sherpa: Status and Advances
Information about Yahoo! PNUTS/Sherpa is so rare. Except the original PNUTS architecture paper (PDF) and the Sherpa: Cloud Computing of the Third Kind slides (PDF), it’s difficult to find something else. But in September, the Yahoo! Developer blog posted an update about Sherpa.
Sherpa Status
- 500+ Sherpa tables
- 50,000+ Sherpa tablets(shards) in operation
- tablets can be copied, moved, or split dynamically
- multi-data center
- multi-tenant: it must support different ranges of read/write ratios
- 75000 requests per second
- supports heterogeneous servers
- some are SSD exclusively
- Sherpa users have access to monitoring tools to examine their app latency and throughput SLA. This implies each application could negotiate different SLAs with the infrastructure team.
- Sherpa default storage engine is MySQL/InnoDB
- storage access have been abstracted
- Yahoo! tested BDB, BDB-Java, and Log-Structured Merge (LSM) Tree developed by Yahoo Labs backends
- Sherpa relies on a reliable messaging system
- can guarantee reliable in-colo and cross-colo transactions
Sherpa Advances
Selective Record Replication
- previous versions supported Table-level replication
- new version support Record-level replication
- designed for efficiency (minimize costs of transfer and storage) and legality (ensure copyright regulations, other legal concerns)
- replication locations are declarative/static
- Sherpa maintains a ‘stub’ of the record in locations that do not have a full copy
- the stub is updated only when a record is created, deleted, or when replication rules change
- the stub is used for routing requests
Backup/Restore
- Support for full table backups has been added
- Point-in-time recovery planned
- Also planned full, cross-colo, and automatic table restoration
Task Manager using Sherpa Tables for Task State
- Sherpa has added a general workflow manager to execute long-running tasks
- It is used for backup and restore operations
The complete post can be read here.
Considering Yahoo! has always been a big proponent of open source projects, it is a pitty that we don’t have the chance to hear more often and more details about PNUTS/Sherpa.
Original title and link: Yahoo! Sherpa: Status and Advances (©myNoSQL)
Monday, 5 December 2011
Hadoop Market: Hortonworks’ Positioning
Eric Baldeschweiler in a recent briefing—transcript by Bert Latamore over Wikibon:
We’re really committed to building out Apache Hadoop and doing it in the Open Source community, so what really differentiates us is being really committed, besides shipping 100% pure Apache Hadoop code, which nobody else does, to taking a very partnering ecosystem-centric approach.[…] We’re the only ones committed to shipping Apache Hadoop code. We’ve been the drivers behind every major release of Apache Hadoop since its inception. Other companies are packaging and distributing Hadoop, but when they do that they add lots of their own custom stuff, both as patches to the Apache Hadoop distribution and also as independent products. A lot of that work is going into Apache, and since we committed to the Open Source model we’ve seen a lot more third party code going into Apache, which is obviously a win for the community. But to date no other company is actually taking releases from Apache & supporting them. They create their own versions that are slightly different from what comes from Apache, and try to build a business around that.
The political message from both Cloudera and Hortonworks is “we compete as businesses, but collaborate for the good of Hadoop“. But behind the curtains, they both prepare the big guns.
Original title and link: Hadoop Market: Hortonworks’ Positioning (©myNoSQL)
Friday, 23 September 2011
Memcached and Sherpa for Yahoo! News Activity Data Service
Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):
Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.
Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table
Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (©myNoSQL)
Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others
Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.
-
Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.
-
Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day
-
University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster
-
Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software
-
eBay: has 3 separate analytics environments:
- 6PB data warehouse for structured data and SQL access
- 40PB deep analytics (Teradata)
- 20PB Hadoop system to support advanced analytic workload on unstructured data
Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others (©myNoSQL)
Monday, 4 July 2011
Aster Data SQL-MapReduce Technology Patent
From a Teradata PR announcement:
SQL-MapReduce® is a framework which enables fast, investigative analysis of complex information by data scientists and business analysts. It enables procedural expressions in software languages (such as Java, C#, Python, C++, and R) to be parallelized across a group of linked computers (compute cluster) and then activated for use (invoked) with standard SQL.
The closest open source solution I can think of is Pig , created and open sourced by Yahoo! (PDF).
Original title and link: Aster Data SQL-MapReduce Technology Patent (©myNoSQL)
Monday, 27 June 2011
Yahoo Launches Hadoop Spinoff
GigaOm breaks the news of the Yahoo! Hadoop engineering spinoff, HortonWorks:
By incorporating next-generation features and capabilities, HortonWorks hopes to make Hadoop easier to consume and better suited for running production workloads. Its products, which likely will include higher-level management tools on top of the core MapReduce and file system layers, will be open source and HortonWorks will try to maintain a close working relationship with Apache. The goal is to make HortonWorks the go-to vendor for a production-ready Hadoop distribution and support, but also to advance Yahoo’s repeated mission of making the official Apache Hadoop distribution the place to go for core software. Earlier this year, Yahoo discontinued its own Hadoop distribution, recommitting all that code and all its development efforts to Apache.
Judging by all the cool projects the Yahoo! engineering team has created, I’ve already said this makes a lot of sense to me.
Original title and link: Yahoo Launches Hadoop Spinoff (©myNoSQL)
via: http://gigaom.com/cloud/exclusive-yahoo-launching-hadoop-spinoff-this-week/
Wednesday, 27 April 2011
Yahoo Could Spinoff Hadoop Software Unit
WSJ:
Yahoo is now weighing spinning off its Hadoop engineering unit into a new firm that would continue to develop the free software and charge companies for its expertise in using it, according to people familiar with the matter.
Makes a lot of sense to me.
-
Link circumventing the WSJ paywall ↩
Original title and link: Yahoo Could Spinoff Hadoop Software Unit (NoSQL databases © myNoSQL)
Friday, 25 March 2011
Mapr: a Competitor to Hadoop Leader Cloudera
They are said to be building a proprietary replacement for the Hadoop Distributed File System that’s allegedly three times faster than the current open-source version. It comes with snapshots and no NameNode single point of failure (SPOF), and is supposed to be API-compatible with HDFS, so it can be a drop-in replacement.
Where can one get Mapr product from?
Considering Yahoo is now focusing on Apache Hadoop and their plans for the next generation Hadoop MapReduce, I wouldn’t hold my breath for Mapr improvements.
Original title and link: Mapr: a Competitor to Hadoop Leader Cloudera (NoSQL databases © myNoSQL)
via: http://gigaom.com/cloud/meet-mapr-a-competitor-to-hadoop-leader-cloudera/
Monday, 28 February 2011
Cloudera’s Distribution for Apache Hadoop version 3 Beta 4
New version of Cloudera’s Hadoop distro — complete release notes available here:
CDH3 Beta 4 also includes new versions of many components. Highlights include:
- HBase 0.90.1, including much improved stability and operability.
- Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
- Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
- Flume 0.9.3, including support for Windows and improved monitoring capabilities.
- Sqoop 1.2, including improvements to usability and Oracle integration.
- Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.
Plus many scalability improvements contributed by Yahoo!.
Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.
Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/02/cdh3-beta-4-now-available
Friday, 18 February 2011
The Next Generation of Apache Hadoop MapReduce
I’m not sure how many companies have already hit this limit, but Yahoo! is showing again its Hadoop leadership:
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

There are way too many interesting aspects covered in the post to spoil the pleasure of diving into them.
Original title and link: The Next Generation of Apache Hadoop MapReduce (NoSQL databases © myNoSQL)
via: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
Monday, 7 February 2011
YCSB Benchmark Results for Cassandra, HBase, MongoDB, Riak
A recent slide deck presenting results of the YCSB a new benchmark run against the latest versions of Cassandra (0.6.10), HBase (0.20.6), MongoDB (1.6.5), and Riak (0.14.0):
Some of the results are striking, so I cannot wonder if there weren’t some configuration issues.
Update: A few users that had more luck reading the details on the slides have pointed out that this is not the YCBS benchmark, but rather a new one developed by the presenter. Another detail that’s important is that data used was rather small and could easily fit in memory.
Original title and link: YCBS Benchmark Results for Cassandra, HBase, MongoDB, Riak (NoSQL databases © myNoSQL)
Sunday, 6 February 2011
The Backstory of Yahoo and Hadoop
We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we’ve invested nearly 300 person-years into these projects. […] Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics.
I assume there’s no surpise to anyone I’m a big fan of Yahoo! open source initiatives.
Original title and link: The Backstory of Yahoo and Hadoop (NoSQL databases © myNoSQL)
via: http://developer.yahoo.com/blogs/hadoop/posts/2011/01/the-backstory-of-yahoo-a
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling