NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



scalability: All content tagged as scalability in NoSQL databases and polyglot persistence

What Scales Best?

Tony Bain:

What is best?  Well that comes down to the resulting complexity, cost, performance and other trade-offs.  Trade-offs are key as there are almost always significant concessions to be made as you scale up.


So what is my point? Well I guess what I am saying is physical scalability is of course an important consideration in determining what is best. But it is only one side of the coin. What it “costs” you in terms of complexity, actual dollars, performance, flexibility, availability, consistency etc, etc are all important too. And these are often relative, what is complex for you may not be complex for someone else.

I concur—a long time ago I wrote: Complexity is a dimension of scalability.

Original title and link: What Scales Best? (NoSQL database©myNoSQL)


The Server Architecture Debate Rages On

Big processors or little processors, scale-up or scale-out, on-premise or in the cloud […] The plethora of choices for application architecture and delivery model are great if you like variety, but I don’t envy anyone tasked with choosing which system on which to spend their limited budget dollars.

Too little options is bad[1]. Too many options are paralizing[2]. Then what’s the solution? I think the only answer is to build experience. By trying, failing, learning, and sharing with everyone else.

Original title and link: The Server Architecture Debate Rages On (NoSQL database©myNoSQL)


The NoSQL Fad

Adam D’Angelo[1]:

I think the “NoSQL” fad will end when someone finally implements a distributed relational database with relaxed semantics.

I believe that defining these relaxed semantics will actually lead to figuring out the origins of many of the NoSQL solutions—just as an example, relaxing the relational model would lead to options like the document model or the BigTable-like columnar model.

  1. Adam D’Angelo: Quora Founder  

Original title and link: The NoSQL Fad (NoSQL database©myNoSQL)


HBase Load Balancing Explained

Ted Yu explains the internals of the HBase load balancing with references to corresponding JIRA tickets and the latest improvements:

If at least one region server joined the cluster just before the current balancing action, both new and old regions from overloaded region servers would be moved onto underloaded region servers. Otherwise, I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.

I am planning for the next generation of load balancer where request histogram would play an important role in deciding which regions to move.

HBase load balancing has also been discussed in this (older) conversation on the mailing list.

Original title and link: HBase Load Balancing Explained (NoSQL databases © myNoSQL)


Optimizing MongoDB: Lessons Learned at Localytics

These slides have generated quite a reaction on Twitter. I’ll let you decide for yourself the reasons:

While there have been lots of retweets, here’s just a glimpse of what type of reactions I’m referring to:

Big Data: Achieve the Impossible in Real-Time

Jean-Pierre Dijcks (Oracle):

The main components in the big data platform provide:

  • Deep Analytics — a fully parallel, extensive and extensible toolbox full of advanced and novel statistical and data mining capabilities
  • High Agility — the ability to create temporary analytics environments in an end-user driven, yet secure and scalable environment to deliver new and novel insights to the operational business
  • Massive Scalability — the ability to scale analytics and sandboxes to previously unknown scales while leveraging previously untapped data potential
  • Low Latency — the ability to instantly act based on these advanced analytics in your operational, production environments

Big Data Platform

If I would be picky, the only thing I’d change would be the order: 1) low latency; 2) massive scalability; 3) high agility; 4) deep analytics.

Original title and link: Big Data: Achieve the Impossible in Real-Time (NoSQL databases © myNoSQL)


Where Riak Fits? Riak’s Sweetspot

Martin Schneider (Basho) trying to answer the question in the title:

Riak can be a data store to a purpose-built enterprise app; a caching layer for an Internet app, or part of the distributed fabric and DNA of a Global app. Those are of course highly arbitrary and vague examples, but it shows how flexible Riak is as a platform.

“Can be” is not quite equivalent with being the right solution and less so with being the best solution. And Martin’s answer to this is:

For super scalable enterprise and global apps — those where the data inside is inherently valuable and dependability of the system to capture, process and store data/writes is imperative — well I see Riak outperforming any perceived competitor in the space in providing value here.

But even for these scenarios, there’s competition from solutions like Cassandra, HBase, and Hypertable — the whole spectrum of scalable storage solutions based on Google BigTable and Amazon Dynamo being covered: HBase (a BigTable implementation), Cassandra (a solution using the BigTable data model and the Dynamo distributed model), and Riak (a solution based mainly on the Amazon Dynamo paper).

While Riak presents itself as the cleanest Dynamo based solution, I would venture to say that both Cassandra and HBase come to table with some interesting characteristics that cannot be ignored:

  1. Strong communities and community driven development processes — both HBase and Cassandra are top Apache Foundation projects
  2. Excellent integration with Hadoop, the leading batch processing solution. DataStax, the company offering services for Cassandra, went the extra-mile of creating a custom Hadoop solution, Brisk, making this integration even better.

Bottom line, I don’t think we can declare a winner in this space and I believe all three solutions will stay around for a while competing for every scenario requiring dependability of the system to capture, process and store data.

Original title and link: Where Riak Fits? Riak’s Sweetspot (NoSQL databases © myNoSQL)

The Social Graph Challenge

Nati Shalom (Gigaspaces) describes a solution to solving a large scale graph problem:

  • Use Memory as the main storage
    • Random I/O access works much better on memory devices than on disk
  • Execute the code with the data - Using Real Time Map/Reduce
    • To reduce the number of iterations required to execute a particular query we use the executor API. The executor API enables us to push the code to the data. By doing that we can execute fairly complex data processing on the data node at memory speed vs network speed.
  • De-normalize the data
    • To reduce the amount of traversal access and network hops per query on the graph we need to copy elements of the graph into each node. For example the list of Friends and friends of friends (up to a certain degree) could be stored in each node and thus become available to any element of the graph  without the need to consult with other nodes.

A couple of comments:

  • if all you have is memory, then you’ll have to replicate data at least 2 or 3 times. Result: more memory needed.
  • de-normalized data means even more memory

All these boil down to the idea Nati has been supporting for a while RAM is the new disk. But I don’t think it applies to BigData.

Below is the complete video:

Original title and link: The Social Graph Challenge (NoSQL databases © myNoSQL)


Multi-tenancy and Cloud Storage Performance

Adrian Cockcroft[1] has a great explanation of the impact of multi-tenancy on cloud storage performance. The connection with NoSQL databases is not necessarily in the Amazon EBS and SSD Price, Performance, QoS comparison, but:


If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people running the benchmark either didn’t know what they were doing or are deliberately trying to make some other system look better. You cannot expect to get consistent measurements of a system that has a very high probability of multi-tenant interference.

  1. Adrian Cockcroft: Netflix, @adrianco  

Original title and link: Multi-tenancy and Cloud Storage Performance (NoSQL databases © myNoSQL)


Scaling an RDBMS in 6 Steps

From Gavin Heavyside’s slides:

  • Launch successful service
  • Read saturation: add caching
  • Write saturation: add hardware
  • Queries slow down: denormalize
  • Reads still too slow: prematerialise common queries, stop joining
  • Writes too slow: drop secondary indexes and triggers

Scaling an RDBMS in 6 Steps

Original title and link: Scaling an RDBMS in 6 Steps (NoSQL databases © myNoSQL)

Scale Fail

Josh Berkus:

Better than MongoDB is Web scale.

Original title and link: Scale Fail (NoSQL databases © myNoSQL)