NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

Zardosht Kasheff from Tokutek:

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

The paper is unfortunately not very readable as it’s constructed as “the patched version of the current protocol”.


Whitepaper Clarifies ACID Support in Aerospike [sponsor]

Aerospike, myNoSQL’s long time supporter, has published a new paper about ACID support in Aerospike. Check out the details below:

In our latest whitepaper, author and Aerospike VP of Engineering & Operations, Srini Srinivasan, defines ACID support in Aerospike, and explains how Aerospike maintains high consistency by using techniques to reduce the possibility of partitions.

Read the whitepaper:

Original title and link: Whitepaper Clarifies ACID Support in Aerospike [sponsor] (NoSQL database©myNoSQL)

7 books for Machine Learning with R

Jason Brownlee put together a list of 7 machine learning books that make use of R:

In this post I want to point out some resources you can use to get started in R for machine learning.

Original title and link: 7 books for Machine Learning with R (NoSQL database©myNoSQL)


SQL-on-Hadoop: Pivotal HAWQ benchmark.

The results bore out Pivotal’s statement that HAWQ is the world’s fastest SQL query engine on Hadoop […] The paper, titled “Orca: A Modular Query Optimizer Architecture for Big Data,” includes benchmark results based on the TPC-DS, a well-known decision support benchmark that models several generally applicable aspects of a decision support system.

Pivotal’s SQL-on-Hadoop solution is based on a cost-based query optimizer.


Spark Summit 2014 roundup

I haven’t been at the Spark Summit and even if the complete event was streamed online, my agenda hasn’t allowed me to watch more than a couple keynotes. Thomas Dinsmore’s notes about the event were quite interesting to get an idea of what happened there.

One thing that caught my attention immediately:

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event. Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three- day affair.

Original title and link: Spark Summit 2014 roundup (NoSQL database©myNoSQL)


The expanding alternative universe of Hadoop

Merv Adrian:

Hadoop has moved from a coarse-grained blunt instrument for largely ETL- style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions.

As the Hadoop alternative universe is expanding, its complexity continues to grow too. The whole purpose of bBig data platforms” from Cloudera and Hortonworks is to make this universe navigable, but it feels the majority of travelers still needs a lot of patience and courage to discover it.

Original title and link: The expanding alternative universe of Hadoop (NoSQL database©myNoSQL)



Mark Callaghan:

Benchmarketing is a common activity for many DBMS products whether they are closed or open source. Most products need new users to maintain viability and marketing is part of the process. The goal for benchmarketing is to show that A is better than B. Either by accident or on purpose good benchmarketing results focus on the message A is better than B rather than A is better than B in this context. Note that the context can be critical and includes the hardware, workload, whether both systems were properly configured and some attempt to explain why one system was faster.

He’s very right about every aspect in the post.

Maybe the only small edit I’d make would be to emphasize once more that the context is critical and if left out it’ll invalidate the value of the benchmark.

Original title and link: Benchmark(et)ing (NoSQL database©myNoSQL)


Dell and Cloudera and Intel join forces for appliances

Me in Intel kills a Hadoop and feeds another:

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?

I didn’t know about the existing Dell-Cloudera-Intel partnership, but this is re-inforced with the recent announcement of an in-memory appliance.

Since 2011, Cloudera, Dell and Intel have built pre-validated reference architectures for Hadoop. […]

The Dell In-Memory Appliances for Cloudera Enterprise is yet another proof point of the collaboration and synergies between the three companies. As the first of a family of appliances, it includes leading Dell hardware, Cloudera’s enterprise data hub -based on Cloudera Enterprise, Intel architecture for fast processing, and ScaleMP’s Versatile SMP (vSMP) architecture to aggregate multiple x86 servers into a single virtual machine to create large memory pools for in-memory processing.

Original title and link: Dell and Cloudera and Intel join forces for appliances (NoSQL database©myNoSQL)

Beating the CAP Theorem Checklist

Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment advocates a way to beat the CAP theorem. Your idea will not work. Here is why it won’t work:

Andrei Savu

Original title and link: Beating the CAP Theorem Checklist (NoSQL database©myNoSQL)


Aerospike: One week of being open source

Brian Bulkowski, co-founder and CTO of Aerospike1 about the recent announcement of open sourcing Aerospike (and a new round of funding):

We didn’t want to open source too early and lose the benefits of focus – nor too late and lose the benefits of broad adoption.


I believe Aerospike’s unique open source strategy has the opportunity to deliver a higher quality open source project than has been delivered in the past.

I was trying earlier this week to remember another project going this route2.

  1. Disclaimer: Aerospike has been a long-time supporter of myNoSQL (and I’m very thankful for that). 

  2. I’m not talking here of TextMate open source abandonware

Original title and link: Aerospike: One week of being open source (NoSQL database©myNoSQL)


Using Elastic MapReduce as a generic Hadoop cluster manager

Steve McPherson for the AWS Blog:

Despite the name Elastic MapReduce, the service goes far beyond batch- oriented processing. Clusters in EMR have a flexible and rich cluster- management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Operational simplicity is a critical aspect for the early days of many companies when large hardware investments and time are so important. Amazon is building a huge data ecosystem to convince its users to stay even afterwards (the more data you put in, the more difficult it’s to move it out later).

Original title and link: Using Elastic MapReduce as a generic Hadoop cluster manager (NoSQL database©myNoSQL)


Three questions about MapR and their products.

There are three things that I’d really appreciate some help understanding:

  1. MapR says it is an Apache Hadoop distribution. Does any of the MapR products include the

    While I know there’s no definition of such a thing, as far as I know self-claimed API compatibility is by no means the same thing as Apache Hadoop.

    I’m also not aware of any action from ASF on this matter.

  2. MapR says it’s the most complete distribution of Hadoop. The matrix below, from Kirill Grigorchuk’s summary of Altoros’s Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR paper, doesn’t seem to confirm this.

    Hadoop distros compared: Cloudera vs Hortonworks vs MapR

  3. MapR says it is committed to open source. I’ve checked the list of committers for Apache Hadoop, Apache HBase, Apache Pig, and Apache ZooKeeper and except Ted Dunning’s PMC role in Apache ZooKeeper, I couldn’t find any MapR employee listed.

Original title and link: Three questions about MapR and their products. (NoSQL database©myNoSQL)