NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Defining Linearizability and Serializability

Peter Bailis provides definitions for linearizability and serializability in plain Englihs:

One of the reasons these definitions are so confusing is that linearizability hails from the distributed systems and concurrent programming communities, and serializability comes from the database community. Today, almost everyone uses both distributed systems and databases, which often leads to overloaded terminology (e.g., “consistency,” “atomicity”).

Original title and link: Defining Linearizability and Serializability (NoSQL database©myNoSQL)


When to Use a NoSQL Database

Gil Allouche:

If your company has complicated, large sets of data that it’s looking to analyze, and that data isn’t simple, structured or predictable data then SQL is not going to meet your needs. While SQL specializes in many things, large amounts of unstructured data is not one of those areas. There are other methods for gathering and analyzing your data that will be much more effective and efficient and probably cost you less too.

It fascinates me how our industry is still looking for generic blueprints for making technical decisions. Based on your own experience how many times did this work? How many times have you been able to make a decision (leading to a successful project) based on a checklist? I can understand that checklists are useful in reducing the initial search area, but the rest should always be based on a combination of experience, learning and understanding, and try-and-error. It doesn’t sound scientific, but I’d argue it’s more scientific than a generic checklist.

Original title and link: When to Use a NoSQL Database (NoSQL database©myNoSQL)


Whitepaper clarifies ACID support in Aerospike [sponsor]

A new whitepaper from Aerospike, myNoSQL’s supporter:

Srini Srinivasan, author and Aerospike VP of Engineering & Operations, has published a new whitepaper in which he explains how Aerospike maintains high consistency by using techniques to reduce the possibility of partitions.

Download the whitepaper ACID Support in Aeropsike (PDF) or read it below.

Hybrid Logical Clocks: Logical clocks and Physical time

Murat Demirbas

In our recent work (in collaboration with Sandeep Kulkarni at Michigan State University), we introduce Hybrid Logical Clocks (HLC). HLC captures the causality relationship like LC, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of PT clocks since it maintains its logical clock to be always close to the PT clock.

Many distributed systems depend on ordering events and in many cases time is the way this ordering is achieved. Spanner’s TrueTime is probably the most “famous” example.


Building a self-serve platform for Hadoop

What big users, in this case Pinterest, would get, ideally, from Hadoop:

Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform. Fortunately there are many Hadoop libraries/applications and service providers that offer solutions to these limitations. Before choosing from these solutions, we mapped out our Hadoop setup requirements.

If you go through the 7 items listed in this post, you’ll have to agree that none sounds unreasonable. Some of these requirements might be Pinterest specific, or at least derived from their size, but I can see how each of them would simplify things. On the other hand, I’m not aware of work being done in any of these areas (nb: security is a hairy topic and everyone wants exactly what they are using).

Original title and link: Building a self-serve platform for Hadoop (NoSQL database©myNoSQL)


Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

Zardosht Kasheff from Tokutek:

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

The paper is unfortunately not very readable as it’s constructed as “the patched version of the current protocol”.


Whitepaper Clarifies ACID Support in Aerospike [sponsor]

Aerospike, myNoSQL’s long time supporter, has published a new paper about ACID support in Aerospike. Check out the details below:

In our latest whitepaper, author and Aerospike VP of Engineering & Operations, Srini Srinivasan, defines ACID support in Aerospike, and explains how Aerospike maintains high consistency by using techniques to reduce the possibility of partitions.

Read the whitepaper:

Original title and link: Whitepaper Clarifies ACID Support in Aerospike [sponsor] (NoSQL database©myNoSQL)

7 books for Machine Learning with R

Jason Brownlee put together a list of 7 machine learning books that make use of R:

In this post I want to point out some resources you can use to get started in R for machine learning.

Original title and link: 7 books for Machine Learning with R (NoSQL database©myNoSQL)


SQL-on-Hadoop: Pivotal HAWQ benchmark.

The results bore out Pivotal’s statement that HAWQ is the world’s fastest SQL query engine on Hadoop […] The paper, titled “Orca: A Modular Query Optimizer Architecture for Big Data,” includes benchmark results based on the TPC-DS, a well-known decision support benchmark that models several generally applicable aspects of a decision support system.

Pivotal’s SQL-on-Hadoop solution is based on a cost-based query optimizer.


Spark Summit 2014 roundup

I haven’t been at the Spark Summit and even if the complete event was streamed online, my agenda hasn’t allowed me to watch more than a couple keynotes. Thomas Dinsmore’s notes about the event were quite interesting to get an idea of what happened there.

One thing that caught my attention immediately:

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event. Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three- day affair.

Original title and link: Spark Summit 2014 roundup (NoSQL database©myNoSQL)


The expanding alternative universe of Hadoop

Merv Adrian:

Hadoop has moved from a coarse-grained blunt instrument for largely ETL- style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions.

As the Hadoop alternative universe is expanding, its complexity continues to grow too. The whole purpose of bBig data platforms” from Cloudera and Hortonworks is to make this universe navigable, but it feels the majority of travelers still needs a lot of patience and courage to discover it.

Original title and link: The expanding alternative universe of Hadoop (NoSQL database©myNoSQL)



Mark Callaghan:

Benchmarketing is a common activity for many DBMS products whether they are closed or open source. Most products need new users to maintain viability and marketing is part of the process. The goal for benchmarketing is to show that A is better than B. Either by accident or on purpose good benchmarketing results focus on the message A is better than B rather than A is better than B in this context. Note that the context can be critical and includes the hardware, workload, whether both systems were properly configured and some attempt to explain why one system was faster.

He’s very right about every aspect in the post.

Maybe the only small edit I’d make would be to emphasize once more that the context is critical and if left out it’ll invalidate the value of the benchmark.

Original title and link: Benchmark(et)ing (NoSQL database©myNoSQL)