NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



distributed: All content tagged as distributed in NoSQL databases and polyglot persistence

Card Payment Sytems and the CAP Theorem

On the surface it would appear that building such a system would be easy since the card vault can be implemented in a data store (either RDBMS or noSQL store) and the data stores schema could be simple, containing just the PAN, token and perhaps some timestamp information. There are plenty of companies that have attempted to build their own card vaults and many vendors offering commercial products. However we shall see later in this article that designing a card vault it requires a distributed data store and a decision is needed on which compromises of the CAP Theorem your system is willing to accept.

Firstly a small correction to the original post: instead of “partition tolerance is not an option”, read “partition tolerance is not optional”.

One of the most frequently asked question about NoSQL databases is “how do they handle transactions. Like in a banking system”. I’ve never developed a banking system, so I don’t know how those work. But I’d bet most of those asking haven’t worked on one either. So why not asking about the solution a NoSQL database would require for the system you are actually working on.

Original title and link: Card Payment Sytems and the CAP Theorem (NoSQL database©myNoSQL)


Distributed Caches, NoSQL Databases, and RDBMS

Greg Luck[1] following up on his article Ehcache: Distributed Cache or NoSQL Store? talks about architectural differences between distributed caches, NoSQL database, and RDBMS and where distributed caches fit:

NoSQL and RDBMS are generally on disk. Disks are mechanical devices and exhibit large latencies due to seek time as the head moves to the right track and read or write times dependent on the RPM of the disk platter. NoSQL tends to optimise disk use, for example, by only appending to logs with the disk head in place and occasionally flushing to disk. By contrast, caches are principally in memory. […] With RDBMS a cache is added to avoid these scale out difficulties. For NoSQL, scale out is built-in, so the cache will get used when lower latencies are required.

  1. Greg Luck: Founder and CTO, Ehcache  

Original title and link: Distributed Caches, NoSQL Databases, and RDBMS (NoSQL database©myNoSQL)


The NoSQL Fad

Adam D’Angelo[1]:

I think the “NoSQL” fad will end when someone finally implements a distributed relational database with relaxed semantics.

I believe that defining these relaxed semantics will actually lead to figuring out the origins of many of the NoSQL solutions—just as an example, relaxing the relational model would lead to options like the document model or the BigTable-like columnar model.

  1. Adam D’Angelo: Quora Founder  

Original title and link: The NoSQL Fad (NoSQL database©myNoSQL)


Druid: Distributed In-Memory OLAP Data Store

Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase).

Stepping back from our two failures, let’s examine why these systems failed to scale for our needs:

  1. Relational Database Architectures

    • Full table scans were slow, regardless of the storage engine used
    • Maintaining proper dimension tables, indexes and aggregate tables was painful
    • Parallelization of queries was not always supported or non-trivial
  2. Massive NOSQL With Pre-Computation

    • Supporting high dimensional OLAP requires pre-computing an exponentially large amount of data

Many of the questions you have in mind have already been asked in the this comment thread, but with not so many answers until now.

Original title and link: Druid: Distributed In-Memory OLAP Data Store (NoSQL databases © myNoSQL)


MongoDB, memcached, EHCache: Compared as Distributed L2 Caches

As can be seen, whether the off-host process that manages the cache-data is MongoD or MemcacheD or Terracotta-Server, architecturally they all look equivalent - i.e. a pure-L2 with no-L1 - so that all data needs to be retrieved from over the network and then massaged into a POJO for consumption by the application.

MongoDB, memcached, EHCache compared

When speaking about caching systems, I’d also include criteria like:

  • warm up strategy
  • locking strategy
  • single-machine memory spill strategy

Original title and link: MongoDB, memcached, EHCache: Compared as Distributed L2 Caches (NoSQL databases © myNoSQL)


Dealing With Distributed State

Jeff Darcy[1]:

The general rule to avoid these kinds of unresolvable conflicts is: don’t pass around references to values that might be inconsistent across systems. It’s like passing a pointer from one address space to a process in another; you just shouldn’t expect it to work. Either pass around the actual values or do calculations involving those values and replicate the result.

When dealing with distributed state think about the actor model.

  1. Jeff Darcy: @Obdurodon  

Original title and link: Dealing With Distributed State (NoSQL databases © myNoSQL)


The Key Technical Challenge of Cloud Computing

Adrian Cockcroft[1]:

The key challenge is to get into the same mind-set as the Google’s of this world, the availability and robustness of your apps and services has to be designed into your software architecture, you have to assume that the hardware and underlying services are ephemeral, unreliable and may be broken or unavailable at any point, and that the other tenants in the multi-tenant public cloud will add random congestion and variance. In reality you always had this problem at scale, even with the most reliable hardware, so cloud ready architecture is about taking the patterns you have to use at large scale, and using them at a smaller scale to leverage the lowest cost infrastructure.

  1. Adrian Cockcroft: Cloud Architect at Netflix, @adrianco  

Original title and link: The Key Technical Challenge of Cloud Computing (NoSQL databases © myNoSQL)


Google Paper: Availability in Globally Distributed Storage Systems

Google paper presented at Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010:

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

Original title and link: Google Paper: Availability in Globally Distributed Storage Systems (NoSQL databases © myNoSQL)


An Introduction to Distributed Filesystems

Jeff Darcy obliged:

[…] when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.

Original title and link: An Introduction to Distributed Filesystems (NoSQL databases © myNoSQL)


Clustrix: Distribution, Fault Tolerance, and Availability Models

Using as a pretext a comparison with MongoDB — why MongoDB? — Sergei Tsarev provides some details about Clustrix data distribution, fault tolerance, and availability models.

At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.

Clustrix Distribution Model

Original title and link: Clustrix: Distribution, Fault Tolerance, and Availability Models (NoSQL databases © myNoSQL)


The NoSQL Gene in SQL Azure Federations

[SQL Azure] Federations bring great benefits of NoSQL model into SQL Azure where it is needed most. I have a special love for RDMSs after having worked on 2, Informix and SQL Server but I also have a great appreciation for NoSQL qualities after having worked on challenging web platforms. These web platforms need flexible app models with elasticity to handle unpredictable capacity requirements and needed the ability to deliver great computational capacity to handle peaks and at the same time deliver that with great economics. NoSQL does bring advantages in this space and I’d argue SQL Azure is inheriting some of these properties of NoSQL through federations.

The way I read it: “we’ve scaled SQL Server as much as we could. Now we need to look at how other scalable distributed systems are built to get us over the deadends we’ve hit”.

Original title and link: The NoSQL Gene in SQL Azure Federations (NoSQL databases © myNoSQL)