NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



The data analytics handbook

A free book based on interviews with data scientists, data analysts, researchers. Available here.

Original title and link: The data analytics handbook (NoSQL database©myNoSQL)

MMS and the state of backups in MongoDB land

So just to be clear, if you are doing it yourself, you are probably settling for something other than a consistent snapshot. Even then, it’s not simple.

I’m always fascinated by companies introducing products by calling out how shitty and complicated their other products are. Axion. Now cleans 10 times better than before.

Original title and link: MMS and the state of backups in MongoDB land (NoSQL database©myNoSQL)


Four Easy Steps to Achieve 1 Million TPS on 1 Server using YCSB Benchmark [sponsor]

Words from myNoSQL’s supporters, Aerospike:

Last year, Aerospike published a ‘recipe’ describing how a database can be tuned to deliver 1 million TPS on a $5k server. This year, we simplified the recipe, applied it to Aerospike, and doubled performance using YCSB tests.

Find out how we did it in four easy steps:

Original title and link: Four Easy Steps to Achieve 1 Million TPS on 1 Server using YCSB Benchmark [sponsor] (NoSQL database©myNoSQL)

A proposal for more reliable locks using Redis

Salvatore Sanfilippo:

Can we have a fast and reliable system at the same time based on Redis? This blog post is an exploration in this area. I’ll try to describe a proposal for a simple algorithm to use N Redis instances for distributed and reliable locks, in the hope that the community may help me analyze and comment the algorithm to see if this is a valid candidate.

As much as I like Redis, use this post as an exercise on how to reason about distributed locks and stick with ZooKeeper for the implementation.

Original title and link: A proposal for more reliable locks using Redis (NoSQL database©myNoSQL)


Cascading components for a Big Data applications

Jules S. Damji in a quick intro to Cascading:

At the core of most data-driven applications is a data pipeline through which data flows, originating from Taps and Sources (ingestion) and ending in a Sink (retention) while undergoing transformation along a pipeline (Pipes, Traps, and Flows). And should something fail, a Trap (exception) must handle it. In the big data parlance, these are aspects of ETL operations.

You have to agree that when compared with the MapReduce model, these components could bring a lot of readability to your code. On the other hand, at a first glance Cascading API still feels verbose.

Original title and link: Cascading components for a Big Data applications (NoSQL database©myNoSQL)


Play with data: Kinetica

Just wow!

Kinetica is a new app for visualizing and exploring data on tablets. Instead of forcing you to use a boring old spreadsheet, Kinetica lets you touch, sift, and play with your data in a physical environment. Each row of data becomes a circle that can be pulled like a magnet into charts, filtered through screens, and selectively highlighted.

Created by a team from Carnigie Mellon, Kinetica is an iPad app. The future of Tableau Software.

Original title and link: Play with data: Kinetica (NoSQL database©myNoSQL)


The era of the No-Design DataBase

Holger Mueller:

So could be the common thread of the new database boom the absence of a design component, the disposition of schema design step that was and is key for the success of any relational database?


Original title and link: The era of the No-Design DataBase (NoSQL database©myNoSQL)


Merge and serialization functions for Riak

Tom Crayford (Yeller) describes how to test the merge and serialization functions used to resolve potential conflicts in Riak:

The way I prefer using riak, is with allow_mult=true. This means that whenever you have causally conflicting writes to a key, riak will store all of them, and upon your next read of that key you have to resolve them yourself. Designing your datatypes such that you can merge them is a huge topic, and an area of active research. However, even once you have a merge strategy worked out, how can you be sure that your reasoning is good? The merge functions you use have to obey a few properties: they have to be commutative, idempotent and associative, or you’ll mess things up when you have conflicts

Original title and link: Merge and serialization functions for Riak (NoSQL database©myNoSQL)


Cloudera, Hadoop, Data warehouses and SLR camera

Amr Adawallah in an interview with Dan Woods for Forbes:

Our advantage is that we can encompass more data and run more workloads with less friction than any other platform. The analogy I use most often is the difference between the SLR camera and the camera on your smart phone. Almost everyone takes more pictures on their smart phone than on their SLR.

The SLR camera is like the enterprise data warehouse. The SLR camera is really, really good at taking pictures, in the same sense that an enterprise data warehouse is really, really good at running queries. But that’s the only thing it does. The data it picks is only exposed to that workload. The system we provide, the enterprise data hub, is more like the smartphone. It can take decent pictures—they won’t be as good as the SLR camera, and in this I’m referring to the Impala system. So Impala will run queries. The queries won’t run at the same interactive OLAP speeds that you get from a high-end data warehouse. However, for many use cases, that performance might be good enough, given that the cost is 10 times lower.

I’ve linked in the past to Ben Thomspon‘s visualizations of the innovator’s dillema:

ben thompson - innovator dilemma

The explanation goes like this: incumbents’ products are usually over-serving consumer needs thus leaving room to new entrants’ good-enough lower-priced products.

Original title and link: Cloudera, Hadoop, Data warehouses and SLR camera (NoSQL database©myNoSQL)


The state of big data in 2014

The (big) data market through the eyes of a VC, Matt Turck of FirstMark Capital:

Still early: Overall, we’re still in the early innings of this market. Over the last couple of years, some promising companies failed (for example: Drawn to Scale), a number saw early exits (for example: Precog, Prior Knowledge, Lucky Sort, Rapleaf, Nodeable, Karmasphere), and a handful saw more meaningful outcomes (for example: Infochimps, Causata, Streambase, ParAccel, Aspera, GNIP, BlueFin labs, BlueKai).

Original title and link: The state of big data in 2014 (NoSQL database©myNoSQL)


The beauty and challenge of Hadoop

Chad Carson describes in a short but persuasive way how Hadoop gets inside companies and the first challenges that follow:

We hear stories like this all the time, though sometimes the urgent email turns out to be from the CEO! These scenarios follow a common pattern in Hadoop adoption: Hadoop is such a flexible, scalable system that it’s easy for an engineer to quickly grab data that could never before be combined in one place, write some jobs, and get interesting results. Sometimes the results are so interesting that other teams start using them, and all of a sudden the company’s business depends on something that started as an experiment.

Original title and link: The beauty and challenge of Hadoop (NoSQL database©myNoSQL)


What versions of Erlang should you use with CouchDB

Ruseel Branca goes through a list of Erlang versions to identify those that are safe to be used with CouchDB:

There has been some discussion on what versions of Erlang CouchDB should support, and what versions of Erlang are detrimental to use. Sadly there were some pretty substantial problems in the R15 line and even parts of R16 that are landmines for CouchDB. This post will describe the current state of things and make some potential recommendations on approach.

Very useful.

Original title and link: What versions of Erlang should you use with CouchDB (NoSQL database©myNoSQL)