Yahoo: All content tagged as Yahoo in NoSQL databases and polyglot persistence
Information about Yahoo! PNUTS/Sherpa is so rare. Except the original PNUTS architecture paper (PDF) and the Sherpa: Cloud Computing of the Third Kind slides (PDF), it’s difficult to find something else. But in September, the Yahoo! Developer blog posted an update about Sherpa.
- 500+ Sherpa tables
- 50,000+ Sherpa tablets(shards) in operation
- tablets can be copied, moved, or split dynamically
- multi-data center
- multi-tenant: it must support different ranges of read/write ratios
- 75000 requests per second
- supports heterogeneous servers
- some are SSD exclusively
- Sherpa users have access to monitoring tools to examine their app latency and throughput SLA. This implies each application could negotiate different SLAs with the infrastructure team.
- Sherpa default storage engine is MySQL/InnoDB
- storage access have been abstracted
- Yahoo! tested BDB, BDB-Java, and Log-Structured Merge (LSM) Tree developed by Yahoo Labs backends
- Sherpa relies on a reliable messaging system
- can guarantee reliable in-colo and cross-colo transactions
Selective Record Replication
- previous versions supported Table-level replication
- new version support Record-level replication
- designed for efficiency (minimize costs of transfer and storage) and legality (ensure copyright regulations, other legal concerns)
- replication locations are declarative/static
- Sherpa maintains a ‘stub’ of the record in locations that do not have a full copy
- the stub is updated only when a record is created, deleted, or when replication rules change
- the stub is used for routing requests
- Support for full table backups has been added
- Point-in-time recovery planned
- Also planned full, cross-colo, and automatic table restoration
Task Manager using Sherpa Tables for Task State
- Sherpa has added a general workflow manager to execute long-running tasks
- It is used for backup and restore operations
The complete post can be read here.
Considering Yahoo! has always been a big proponent of open source projects, it is a pitty that we don’t have the chance to hear more often and more details about PNUTS/Sherpa.
Original title and link: Yahoo! Sherpa: Status and Advances ( ©myNoSQL)
Eric Baldeschweiler in a recent briefing—transcript by Bert Latamore over Wikibon:
We’re really committed to building out Apache Hadoop and doing it in the Open Source community, so what really differentiates us is being really committed, besides shipping 100% pure Apache Hadoop code, which nobody else does, to taking a very partnering ecosystem-centric approach.[…] We’re the only ones committed to shipping Apache Hadoop code. We’ve been the drivers behind every major release of Apache Hadoop since its inception. Other companies are packaging and distributing Hadoop, but when they do that they add lots of their own custom stuff, both as patches to the Apache Hadoop distribution and also as independent products. A lot of that work is going into Apache, and since we committed to the Open Source model we’ve seen a lot more third party code going into Apache, which is obviously a win for the community. But to date no other company is actually taking releases from Apache & supporting them. They create their own versions that are slightly different from what comes from Apache, and try to build a business around that.
The political message from both Cloudera and Hortonworks is “we compete as businesses, but collaborate for the good of Hadoop“. But behind the curtains, they both prepare the big guns.
Original title and link: Hadoop Market: Hortonworks’ Positioning ( ©myNoSQL)
Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.
Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.
Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day
University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster
Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software
eBay: has 3 separate analytics environments:
- 6PB data warehouse for structured data and SQL access
- 40PB deep analytics (Teradata)
- 20PB Hadoop system to support advanced analytic workload on unstructured data
Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others ( ©myNoSQL)
From a Teradata PR announcement:
SQL-MapReduce® is a framework which enables fast, investigative analysis of complex information by data scientists and business analysts. It enables procedural expressions in software languages (such as Java, C#, Python, C++, and R) to be parallelized across a group of linked computers (compute cluster) and then activated for use (invoked) with standard SQL.
Original title and link: Aster Data SQL-MapReduce Technology Patent ( ©myNoSQL)
Yahoo is now weighing spinning off its Hadoop engineering unit into a new firm that would continue to develop the free software and charge companies for its expertise in using it, according to people familiar with the matter.
Makes a lot of sense to me.
Link circumventing the WSJ paywall ↩