ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

distributed systems: All content tagged as distributed systems in NoSQL databases and polyglot persistence

Distributed System Reliability: It's About Operations, Not Architecture or Design

Jay Kreps1:

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. […] I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations.


  1. Jay Kreps: works for LinkedIn where he is the technical lead for the SNA team that handles search, social graph, data infrastructure, and recommendation systems. 

Original title and link: Distributed System Reliability: It’s About Operations, Not Architecture or Design (NoSQL database©myNoSQL)

via: http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability


Visualizing System Latency

Besides the many practical lessons emphasized in Jack Clark’s interview with Adrian Cockcroft on ZDNet—luckly I’ve had the chance to see some of Cockcroft’s presentations about Netflix architecture and also talk to him directly—one thing that sticked with me was the ending paragraph:

The thing I’ve been publicly asking for has been better IO in the cloud. Obviously I want SSDs in there. We’ve been asking cloud vendors to do that for a while. With Cassandra, we’ve had to go onto horizontal scale and use the internal disks and triple replicate across availability zones, so you end up with a triple-redundant data store that is careful not to overload the disks.

That reminded me of this old ACM article authored by Brendan Gregg:

When I/O latency is presented as a visual heat map, some intriguing and beautiful patterns can emerge. These patterns provide insight into how a system is actually performing and what kinds of latency end-user applications experience. Many characteristics seen in these patterns are still not understood, but so far their analysis is revealing systemic behaviors that were previously unknown.

Visualizing system latency - Sequential disk reads, stepping disk count

I was wondering if in the NoSQL databases space (and data storage space in general) are there any of the monitoring tools that provide such advanced visualization of latency data. Do you know any?

Original title and link: Visualizing System Latency (NoSQL database©myNoSQL)


What Can Be Learned From Heroku Outage Postmortem

While some may learn a few new things or get a confirmation in the very details of the outage, what caught my attention in the Heroku’s postmortem analysis is the conclusions:

  • higher sensitivity and more aggressive monitoring on a variety of metrics
  • improved early warning systems
  • better containment
  • improved flow controls, both manual and automatic
  • expanding simulations of unusual load conditions in our staging environment

None of these are particular to a specific storage or NoSQL database. But they all reflect the reality of operating at large scale where even the most operationally friendly solutions—think of Dynamo-inspired NoSQL databases—cannot and should not be left unmonitored or unsupervised or with no clear recovery strategies and processes in place.

In the NoSQL world, one of the most covered outages was the MongoDB outage at Foursquare. And in case you don’t remember the details, most of the circumstances that led to that event could have been prevented by having:

  1. better monitoring
  2. early warnings
  3. better operational procedures

Aren’t these two lists looking very alike?

Original title and link: What Can Be Learned From Heroku Outage Postmortem (NoSQL database©myNoSQL)

via: https://status.heroku.com/incident/308


Eventual and Strong Consistency, Sloppy and Strict Quorums, and Other Lessons and Thoughts on Distributed Systems

Anything I’d write would just steal from your time to read and think about the email Joseph Blomstedt posted to the Riak list.

Original title and link: Eventual and Strong Consistency, Sloppy and Strict Quorums, and Other Lessons and Thoughts on Distributed Systems (NoSQL database©myNoSQL)


Spotify Architecture: The Peer to Peer Network

Spotify architecture

  • Spotify’s p2p network works like a BitTorrent network to locate peers […]. It uses a proprietary protocol designed especially for streaming music.
  • There’s no “preferred” peers or supernodes, but a future improvement might be to use peer-to-peer overlays to exploit the overlap in interests between users.
  • The maximum number of peers in the network is 60, with a soft-limit of 50 peers.
  • The client uploads to at most 4 peers at a time.
  • Server-side trackers and network queries are used to locate other users who have the music you’re listening to.
  • Spotify uses TCP as the transport protocol instead of UDP, since it can take advantage of TCP’s congestion controls and ability to re-send lost packets.
  • Most users have a large cache […]. This helps keep network traffic down since most users listen to tracks more than once.
  • At Spotify’s end, there’s a master storage area (290TB) and two production storage areas (90TB in London, 90TB in Stockholm).

How many characteristics of the distributed NoSQL databases can you identify in Spotify’s architecture?

Original title and link: Spotify Architecture: The Peer to Peer Network (NoSQL database©myNoSQL)

via: http://pansentient.com/2011/04/spotify-technology-some-stats-and-how-spotify-works/


Apache ZooKeeper 3.4.0 Released to Be Followed Soon by Production-Ready Version

Apache ZooKeeper, the high-performance coordination service exposing services like naming, configuration management, synchronization, etc. for distributed applications, has reached version 3.4.0.

Even if the official announcement was laconic, ZooKeeper 3.4.0 features over 150 fixes.

The most important ones are summarized by Patrick Hunt in this Cloudera blog post:

  • ZooKeeper 3.3.3 clients are compatible with 3.4.0 servers
  • Native Windows version of C client
  • Support Kerberos authentication of clients
  • Support Kerberos authentication of clients
  • Improved REST Interface
  • Existing monitoring support has been extended through the introduction of a new ‘mntr’ 4 letter word
  • Add tools and recipes for monitoring as a contrib
  • Web-based Administrative Interface
  • Automating log and snapshot cleaning
  • Add logging/stats to identify production deployment issues
  • Support for building RPM and DEB packages

Something to keep in mind though: ZooKeeper 3.4.0 is not production ready yet. After extensive testing, it will be followed soon by a minor release that will be production-ready.

Original title and link: Apache ZooKeeper 3.4.0 Released to Be Followed Soon by Production-Ready Version (NoSQL database©myNoSQL)


The Main Benefits of Riak

Dave Smith[1] about Riak:

Low, predictable latency data access and operational ease-of-use. When used appropriately (i.e. don’t expect Riak to be a RDBMS), Riak is able to provide an excellent latency profile, even in the case of multiple node failures. In addition, it’s easy to add/remove nodes while the system is running without a lot of operational juggling. Another useful benefit is that multiple nodes can take writes for the same key, thus making it easier to construct a geographically distributed data store.


  1. Dave Smith, Director of Engineering at Basho Technologies, Inc.  

Original title and link: The Main Benefits of Riak (NoSQL database©myNoSQL)

via: http://pdincau.wordpress.com/2011/06/24/an-interview-with-dave-smith-dizzyd/


Zynga, Data Centers, Polyglot Persistence, and Big Data

Cadir Lee (CTO Zynga) quoted in a VentureBeat post:

It’s not the amount of hardware that matters. It’s the architecture of the application. You have to work at making your app architecture so that it takes advantage of Amazon. You have to have complete fluidity with the storage tier, the web tier. We are running our own data centers. We are looking more at doing our own data centers with more of a private cloud.

Couple of thoughts:

  1. Zynga is going the opposite direction than Netflix. While Netflix is focusing (by using Amazon for most of their infrastructure), Zynga is diversifying (building their own data centers) .
  2. Zynga’s applications are great examples of where fully distributed NoSQL databases fit. Availability is key.
  3. My answer to the question: “how many Zyngas are out there” would be: “enough to ensure some good business for the most reliable and scalable distributed databases”
  4. Zynga has contributed and is an investor in Membase, the company that merged with CouchOne to form Couchbase. But Zynga was using a custom version of Membase.
  5. Zynga also operates a large MySQL cluster.
  6. Zynga processes over 15 terabytes of game data every day (according to their SEC filing ). That’s Hadoop sweet spot.

PS: I’d love to talk to someone from Zynga about their data storage approach. If you have any connections I’d really appreciate an introduction.

Original title and link: Zynga, Data Centers, Polyglot Persistence, and Big Data (NoSQL database©myNoSQL)


Google: 'At Scale, Everything Breaks'

Urs Hölzle interviewed by ZDNet:

I think the big challenges haven’t changed that much. I’d say that it’s dealing with failure, because at scale everything breaks no matter what you do and you have to deal reasonably cleanly with that and try to hide it from the people actually using your system.

Make sure you read the whole article, especially the part about what Hölzle considers to be today’s captivating technical problems: fault tolerance, dealing with state and mutable states, using commodity hardware while dealing with rapid innovation cycles.

Original title and link: Google: ‘At Scale, Everything Breaks’ (NoSQL database©myNoSQL)

via: http://www.zdnet.co.uk/news/cloud/2011/06/22/google-at-scale-everything-breaks-40093061/


How to Decrease the Pain in Building Distributed Systems

Bradford Stephens[1] talking distributed systems:

Building distributed systems is painful. Many organizations are approaching the point where their data and application infrastructures are being run on many servers (in the cloud or datacenter). Our software practices don’t reflect that, often with disastrous results. This talk is a collection of scalability principles, anecdotes and practices from experts (and personal experience) from engineering, operations, and business perspectives.


Distributed Systems Papers

Just in case you ever run out of readings on distributed systems, Dr. Indranil Gupta has put together an extensive list of papers including all the classics, but also some you might not have heard about. Here they are for you to enjoy.

Todd Hoff

Original title and link: Distributed Systems Papers (NoSQL databases © myNoSQL)