ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Using Elastic MapReduce as a generic Hadoop cluster manager

Steve McPherson for the AWS Blog:

Despite the name Elastic MapReduce, the service goes far beyond batch- oriented processing. Clusters in EMR have a flexible and rich cluster- management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Operational simplicity is a critical aspect for the early days of many companies when large hardware investments and time are so important. Amazon is building a huge data ecosystem to convince its users to stay even afterwards (the more data you put in, the more difficult it’s to move it out later).

Original title and link: Using Elastic MapReduce as a generic Hadoop cluster manager (NoSQL database©myNoSQL)

via: http://aws.amazon.com/blogs/aws/emr-as-generic-hadoop-clister-manager/


Three questions about MapR and their products.

There are three things that I’d really appreciate some help understanding:

  1. MapR says it is an Apache Hadoop distribution. Does any of the MapR products include the

    While I know there’s no definition of such a thing, as far as I know self-claimed API compatibility is by no means the same thing as Apache Hadoop.

    I’m also not aware of any action from ASF on this matter.

  2. MapR says it’s the most complete distribution of Hadoop. The matrix below, from Kirill Grigorchuk’s summary of Altoros’s Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR paper, doesn’t seem to confirm this.

    Hadoop distros compared: Cloudera vs Hortonworks vs MapR

  3. MapR says it is committed to open source. I’ve checked the list of committers for Apache Hadoop, Apache HBase, Apache Pig, and Apache ZooKeeper and except Ted Dunning’s PMC role in Apache ZooKeeper, I couldn’t find any MapR employee listed.

Original title and link: Three questions about MapR and their products. (NoSQL database©myNoSQL)


Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez

Hosted on amplab, the origin of Spark this benchmark compares Redshift, Hive, Shark, Impala, Stinger/Tez:

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP- like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitative and qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

More important than the results:

  1. the clear methodology
  2. and its reproducibility

Original title and link: Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez (NoSQL database©myNoSQL)

via: https://amplab.cs.berkeley.edu/benchmark/


Cayley: an open-source graph database

From the GitHub repo:

Cayley is an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph.

  • Written in Go
  • Easy to get running (3 or 4 commands, below)
  • RESTful API * or a REPL if you prefer
  • Built-in query editor and visualizer
  • Multiple query languages: * JavaScript, with a Gremlin-inspired* graph object. * (simplified) MQL, for Freebase fans
  • Plays well with multiple backend stores: * LevelDB for single-machine storage * MongoDB * In-memory, ephemeral
  • Modular design; easy to extend with new languages and backends
  • Good test coverage
  • Speed, where possible.

✚ What’s interesting is that even if under Google’s GitHub account, the project is not backed by Google.

✚ The Hacker News thread focuses on the existing graph database market.

Original title and link: Cayley: an open-source graph database (NoSQL database©myNoSQL)


LinkedIn's new search platform

In this post introducing the new search solution implemented at LinkedIn, you can find a pretty good list of the requirements for a good search tool. In the form of what were the showstoppers hit with the previous solution:

  • Rebuilding a complete index is extremely difficult
  • Live updates are at an entity granularity
  • Scoring is inflexible
  • Too many small open sources components

On top of these, add flexibility and extensibility, something that is important for every critical component, but much more so for search which depends so heavily on the format, behavior, and fine tunning.

The rest of the post dives into some details of the new solution, which is a distributed layer of extensions on top of Lucene, code named Galene.

Original title and link: LinkedIn’s new search platform (NoSQL database©myNoSQL)

via: https://engineering.linkedin.com/search/did-you-mean-galene


Visualizing Algorithms

Mike Bostock:

Algorithms are a fascinating use case for visualization. To visualize an algorithm, we don’t merely fit data to a chart; there is no primary dataset. Instead there are logical rules that describe behavior. This may be why algorithm visualizations are so unusual, as designers experiment with novel forms to better communicate. This is reason enough to study them.

You are in for a BIG treat. Set aside at least 30 minutes to savor this article.

Original title and link: Visualizing Algorithms (NoSQL database©myNoSQL)

via: http://bost.ocks.org/mike/algorithms/


Aerospike in-Memory NoSQL database is now Open Source [sponsor]

Big news coming from myNoSQL’s supporters Aerospike:


Aerospike in-memory NoSQL database is now open-source.

Read the news and see who scales with Aerospike. Check out the code on github!

Original title and link: Aerospike in-Memory NoSQL database is now Open Source [sponsor] (NoSQL database©myNoSQL)


Moving product recommendations from Hadoop to Redshift saves us time and money

Our old relational data warehousing solution, Hive, was not performant enough for us to generate product recommendations in SQL in our configuration.

This right here describes the common theme across all “Redshift is so much faster and cheaper than Hive”: expect a relational data warehouse from a Hadoop and Hive. You tell me if that’s the right expectation.

Here are other similar “revelations”:

Original title and link: Moving product recommendations from Hadoop to Redshift saves us time and money (NoSQL database©myNoSQL)

via: http://engineering.monetate.com/2014/06/18/moving-product-recommendations-from-hadoop-to-redshift-saves-us-time-and-money/


Neo4j unit testing with GraphUnit

Testing the state of an Embedded Neo4j database is now much easier if you use GraphUnit, a component of the GraphAware Neo4j Framework.

Interesting approach. The only downside I could see at the first glance is that it might become a maintenance nightmare if your model evolves and data changes.

Original title and link: Neo4j unit testing with GraphUnit (NoSQL database©myNoSQL)

via: http://thought-bytes.blogspot.in/2014/06/neo4j-unit-testing-with-graphunit.html


Enterprise-class NoSQL

What is distinctive about an enterprise-class NoSQL database is its support for additional enterprise-scale application requirements, namely: ACID (atomic, consistent, isolated, and durable) transactions, government-grade security and elasticity, as well as automatic failover.

What is distinctive about an enterprise-class NoSQL database is what my company is selling.

If that would be true, I doubt we would have no any other databases around considering MarkLogic’ age and perfect fit.

Snarky comments aside, the enterprise requirements are so complicated, numerous, political and sometime non-technical, that I don’t think anyone would ever be able to come up with a definition or (even if extremely long) checklist of what’s enterprise-grade.

Original title and link: Enterprise-class NoSQL (NoSQL database©myNoSQL)

via: http://www.information-age.com/technology/information-management/123458126/putting-enterprise-nosql-acid-ambiguity-out


What does comprehensive security mean for Hadoop?

Hortonworks and their new security team explain the current status and their plans for a “holistic and comprehensive” security solution for Hadoop:

A comprehensive security approach means that irrespective of how the data is stored and accessed, there should be an integrated framework for securing data. Enterprises may adopt any use case (batch, real time, interactive), but data should be secured through the same standards, and security should be administered centrally and in one place.

✚ If you have only a couple of seconds, focus on the diagram under the section “HDP + XA - Current offering” and skim over the following 4 sections: Authentication, Authorization, Auditing, Data protection

hdpsecurity

✚ It’s safe to assume this post was meant to introduce Hortonwork’s position to Hadoop security as compared to Cloudera’s (and their collaboration on security aspects with Intel):

Original title and link: What does comprehensive security mean for Hadoop? (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hortonworks-offers-holistic-comprehensive-security-hadoop/


RethinkDB 1.13: new protocol and push-pull APIs

Some interesting changes and new features in RethinkDB 1.13 announced yesterday. Namely:

  • replacing the protocol buffers-based protocol for a JSON-protocol

    • how does the JSON protocol manage the non-JSON data types?
    • how fast is a text-based protocol?
  • notifications about document changes

    I’ve always said this was the coolest feature in CouchDB and that every database should support it.

  • a weird1 new http command to pull JSON data from the web

I’ve checked again the RethinkDB stability report and I’m not sure that reads as “yep, RethinkDB is finally production ready”.


  1. Knowing the team there, I’m pretty sure this is coming from a use case I’m not seeing. 

Original title and link: RethinkDB 1.13: new protocol and push-pull APIs (NoSQL database©myNoSQL)