NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Dell and Cloudera and Intel join forces for appliances

Me in Intel kills a Hadoop and feeds another:

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?

I didn’t know about the existing Dell-Cloudera-Intel partnership, but this is re-inforced with the recent announcement of an in-memory appliance.

Since 2011, Cloudera, Dell and Intel have built pre-validated reference architectures for Hadoop. […]

The Dell In-Memory Appliances for Cloudera Enterprise is yet another proof point of the collaboration and synergies between the three companies. As the first of a family of appliances, it includes leading Dell hardware, Cloudera’s enterprise data hub -based on Cloudera Enterprise, Intel architecture for fast processing, and ScaleMP’s Versatile SMP (vSMP) architecture to aggregate multiple x86 servers into a single virtual machine to create large memory pools for in-memory processing.

Original title and link: Dell and Cloudera and Intel join forces for appliances (NoSQL database©myNoSQL)

Beating the CAP Theorem Checklist

Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment advocates a way to beat the CAP theorem. Your idea will not work. Here is why it won’t work:

Andrei Savu

Original title and link: Beating the CAP Theorem Checklist (NoSQL database©myNoSQL)


Aerospike: One week of being open source

Brian Bulkowski, co-founder and CTO of Aerospike1 about the recent announcement of open sourcing Aerospike (and a new round of funding):

We didn’t want to open source too early and lose the benefits of focus – nor too late and lose the benefits of broad adoption.


I believe Aerospike’s unique open source strategy has the opportunity to deliver a higher quality open source project than has been delivered in the past.

I was trying earlier this week to remember another project going this route2.

  1. Disclaimer: Aerospike has been a long-time supporter of myNoSQL (and I’m very thankful for that). 

  2. I’m not talking here of TextMate open source abandonware

Original title and link: Aerospike: One week of being open source (NoSQL database©myNoSQL)


Using Elastic MapReduce as a generic Hadoop cluster manager

Steve McPherson for the AWS Blog:

Despite the name Elastic MapReduce, the service goes far beyond batch- oriented processing. Clusters in EMR have a flexible and rich cluster- management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Operational simplicity is a critical aspect for the early days of many companies when large hardware investments and time are so important. Amazon is building a huge data ecosystem to convince its users to stay even afterwards (the more data you put in, the more difficult it’s to move it out later).

Original title and link: Using Elastic MapReduce as a generic Hadoop cluster manager (NoSQL database©myNoSQL)


Three questions about MapR and their products.

There are three things that I’d really appreciate some help understanding:

  1. MapR says it is an Apache Hadoop distribution. Does any of the MapR products include the

    While I know there’s no definition of such a thing, as far as I know self-claimed API compatibility is by no means the same thing as Apache Hadoop.

    I’m also not aware of any action from ASF on this matter.

  2. MapR says it’s the most complete distribution of Hadoop. The matrix below, from Kirill Grigorchuk’s summary of Altoros’s Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR paper, doesn’t seem to confirm this.

    Hadoop distros compared: Cloudera vs Hortonworks vs MapR

  3. MapR says it is committed to open source. I’ve checked the list of committers for Apache Hadoop, Apache HBase, Apache Pig, and Apache ZooKeeper and except Ted Dunning’s PMC role in Apache ZooKeeper, I couldn’t find any MapR employee listed.

Original title and link: Three questions about MapR and their products. (NoSQL database©myNoSQL)

Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez

Hosted on amplab, the origin of Spark this benchmark compares Redshift, Hive, Shark, Impala, Stinger/Tez:

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP- like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitative and qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

More important than the results:

  1. the clear methodology
  2. and its reproducibility

Original title and link: Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez (NoSQL database©myNoSQL)


Cayley: an open-source graph database

From the GitHub repo:

Cayley is an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph.

  • Written in Go
  • Easy to get running (3 or 4 commands, below)
  • RESTful API * or a REPL if you prefer
  • Built-in query editor and visualizer
  • Multiple query languages: * JavaScript, with a Gremlin-inspired* graph object. * (simplified) MQL, for Freebase fans
  • Plays well with multiple backend stores: * LevelDB for single-machine storage * MongoDB * In-memory, ephemeral
  • Modular design; easy to extend with new languages and backends
  • Good test coverage
  • Speed, where possible.

✚ What’s interesting is that even if under Google’s GitHub account, the project is not backed by Google.

✚ The Hacker News thread focuses on the existing graph database market.

Original title and link: Cayley: an open-source graph database (NoSQL database©myNoSQL)

LinkedIn's new search platform

In this post introducing the new search solution implemented at LinkedIn, you can find a pretty good list of the requirements for a good search tool. In the form of what were the showstoppers hit with the previous solution:

  • Rebuilding a complete index is extremely difficult
  • Live updates are at an entity granularity
  • Scoring is inflexible
  • Too many small open sources components

On top of these, add flexibility and extensibility, something that is important for every critical component, but much more so for search which depends so heavily on the format, behavior, and fine tunning.

The rest of the post dives into some details of the new solution, which is a distributed layer of extensions on top of Lucene, code named Galene.

Original title and link: LinkedIn’s new search platform (NoSQL database©myNoSQL)


Visualizing Algorithms

Mike Bostock:

Algorithms are a fascinating use case for visualization. To visualize an algorithm, we don’t merely fit data to a chart; there is no primary dataset. Instead there are logical rules that describe behavior. This may be why algorithm visualizations are so unusual, as designers experiment with novel forms to better communicate. This is reason enough to study them.

You are in for a BIG treat. Set aside at least 30 minutes to savor this article.

Original title and link: Visualizing Algorithms (NoSQL database©myNoSQL)


Aerospike in-Memory NoSQL database is now Open Source [sponsor]

Big news coming from myNoSQL’s supporters Aerospike:

Aerospike in-memory NoSQL database is now open-source.

Read the news and see who scales with Aerospike. Check out the code on github!

Original title and link: Aerospike in-Memory NoSQL database is now Open Source [sponsor] (NoSQL database©myNoSQL)

Moving product recommendations from Hadoop to Redshift saves us time and money

Our old relational data warehousing solution, Hive, was not performant enough for us to generate product recommendations in SQL in our configuration.

This right here describes the common theme across all “Redshift is so much faster and cheaper than Hive”: expect a relational data warehouse from a Hadoop and Hive. You tell me if that’s the right expectation.

Here are other similar “revelations”:

Original title and link: Moving product recommendations from Hadoop to Redshift saves us time and money (NoSQL database©myNoSQL)


Neo4j unit testing with GraphUnit

Testing the state of an Embedded Neo4j database is now much easier if you use GraphUnit, a component of the GraphAware Neo4j Framework.

Interesting approach. The only downside I could see at the first glance is that it might become a maintenance nightmare if your model evolves and data changes.

Original title and link: Neo4j unit testing with GraphUnit (NoSQL database©myNoSQL)