Yahoo: All content tagged as Yahoo in NoSQL databases and polyglot persistence
Monday, 6 September 2010
High Availability MySQL at Yahoo!
Jay Jenssen talks about Yahoo!’s approach
Now, what makes our solution different? Not much. The layout is this: two master databases, one in each of our two colocations. These masters replicate from each other, but we would never have more than two masters in this replication loop for the same reason we don’t use token ring networks today: one master outage would break replication in a chain of size > 2. Our slaves replicate from one of the two masters, often half of the slaves in a given colocation replicate from one of the masters, and half from the other master.
But there is much more in the original article (e.g. allowing writes to a single master, dealing with failure, etc.). There are also three slide decks on infrastructure resiliency, high availability/business continuity planning, and application resiliency.
Infrastructure resiliency at Yahoo
High availability/Business continuity planning at Yahoo
Application resiliency at Yahoo
It doesn’t sound so exciting as what Google is doing, or Facebook, but it is probably something many could learn from.
Original title and link for this post: High Availability MySQL at Yahoo! (published on the NoSQL blog: myNoSQL)
via: http://mysqlguy.net/blog/2010/08/03/mysql-master-ha-yahoo
Monday, 30 August 2010
Pig and Hive at Yahoo!
Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:
The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.
The use cases mentioned in the post:
- data preparation and presentation:
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
- data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
- data warehouse: business-intelligence analysis and ad-hoc queries
In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.
Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.
Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)
via: http://developer.yahoo.net/blogs/hadoop/2010/08/pig_and_hive_at_yahoo.html
Tuesday, 17 August 2010
Howl: Unifying Metadata Layer for Hive and Pig
Yet another contribution from Yahoo!:
Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive
Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL
Tuesday, 6 April 2010
Hadoop User Group March Meeting Recap
The meeting hosted lots of discussions and 3 presentations:
Owen O’Malley: Upcoming Hadoop Security release
Owen O’Malley from the Yahoo! Hadoop Team provided an overview of the upcoming Hadoop Security release. Owen described the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.
Tyson Condie: Hadoop Online
Tyson Condie a Ph.D. student at the University of California, Berkeley, presented the innovative research around Hadoop Online efforts lead by Prof. Joseph M. Hellerstein . Tyson described a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, can reduce completion times and improve system utilization. Tyson included examples from the HOP - Hadoop Online Prototype project.
Bradford Cross: Flightcaster
Bradford Cross from Flightcaster provided an exciting overview on the FlightCaster flight delays prediction service and some cool insights into the airline industry. Bradford described how they built a scalable machine learning and data analysis platform using Clojure dynamic programming language wrapping Cascading and Hadoop. Bradford demonstrated how the use of Hadoop makes building scalable systems much simpler
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling