ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

column store: All content tagged as column store in NoSQL databases and polyglot persistence

Comparing NoSQL backup solutions

In a post introducing HyperDex backups, Robert Escriva compares the different backup solutions available in Cassandra, MongoDB, and Riak:

Cassandra: Cassandra’s backups are inconsistent, as they are taken at each server independently without coordination. Further, “Restoring from snapshots and incremental backups temporarily causes intensive CPU and I/O activity on the node being restored.”

MongoDB: MongoDB provides two backup strategies. The first strategy copies the data on backup, and re-inserts it on restore. This approach introduces high overhead because it copies the entire data set without opportunity for incremental backup.

The second approach is to use filesystem-provided snapshots to quickly backup the data of a mongod instance. This approach requires operating system support and will produce larger backup sizes.

Riak: Riak backups are inconsistent, as they are taken at each server independently without coordination, and require care when migrating between IP addresses. Further, Riak requires that each server be shut down before backing up LevelDB-powered backends.

How is HyperDex’s new backup described:

The HyperDex backup/restore process is strongly consistent, doesn’t require shutting down servers, and enables incremental backup support. Further, the process is quite efficient; it completes quickly, and does not consume CPU or I/O for extended periods of time.

The caveat is that HyperDex puts the cluster in read-only mode for backing up. That’s loss of availability. Considering both Cassandra and Riak promise is high availability, their choice was clear.

Update: This comment from Emin Gün Sirer makes me wonder if I missed something:

HyperDex quiesces the network, takes a snapshot, resumes. Whole operation takes sub-second latency.

The key point is that the system is online, available while the data copying is taking place.

Original title and link: Comparing NoSQL backup solutions (NoSQL database©myNoSQL)

via: http://hackingdistributed.com/2014/01/14/back-that-nosql-up/


Cassandra CQL and the IN operator

The most succint description of how to use IN in CQL:

  1. The last column in the partition key, assuming the = operator is used on the first N-1 columns of the partition key
  2. The last clustering column, assuming the = operator is used on the first N-1 clustering columns and all partition keys are restricted
  3. The last clustering column, assuming the = operator is used on the first N-1 clustering columns and ALLOW FILTERING is specified

I like clear rules.

Original title and link: Cassandra CQL and the IN operator (NoSQL database©myNoSQL)

via: http://planetcassandra.org/blog/post/the-in-operator-in-cassandra-cql


MySQL is a great Open Source project. How about open source NoSQL databases?

In a post titled Some myths on Open Source, the way I see it, Anders Karlsson writes about MySQL:

As far as code, adoption and reaching out to create an SQL-based RDBMS that anyone can afford, MySQL / MariaDB has been immensely successful. But as an Open Source project, something being developed together with the community where everyone work on their end with their skills to create a great combined piece of work, MySQL has failed. This is sad, but on the other hand I’m not so sure that it would have as much influence and as wide adoption if the project would have been a “clean” Open Source project.

The article offers a very black-and-white perspective on open source versus commercial code. But that’s not why I’m linking to it.

The above paragraph made me think about how many of the most popular open source NoSQL databases would die without the companies (or people) that created them.

Here’s my list: MongoDB, Riak, Neo4j, Redis, Couchbase, etc. And I could continue for quite a while considering how many there are out there: RavenDB, RethinkDB, Voldemort, Tokyo, Titan.

Actually if you reverse the question, the list would get extremely short: Cassandra, CouchDB (still struggling though), HBase. All these were at some point driven by community. Probably the only special case could be LevelDB.

✚ As a follow up to Anders Karlsson post, Robert Hodges posted The Scale-Out Blog: Why I Love Open Source.

Original title and link: MySQL is a great Open Source project. How about open source NoSQL databases? (NoSQL database©myNoSQL)

via: http://karlssonondatabases.blogspot.com/2014/01/some-myths-on-open-source-way-i-see-it.html


Google Compute Engine and Data

Since announcing the GA couple of weeks ago, I’ve been noticing quite a few data related posts on the Google Compute Engine blog:

If you look at these, you’ll notice a theme: covering data from every angle; Cassandra/DSE from DataStax for OLTP, DataTorrent for stream processing, Qubole for Hadoop, MapR for their Hadoop-like solution. I can see this continuing for a while and making Google Compute Engine a strong competitor for Amazon Web Services.

One question remains though: will they be able to come up with a good integration strategy for all these 3rd party tools?

Original title and link: Google Compute Engine and Data (NoSQL database©myNoSQL)


An intro to HBase’s Thrift interface

If you’ve never used Thrift (with or without HBase), the two articles authored by Jesse Anderson and posted on Cloudera’s blog will give you both a quick intro and

  1. How-to: Use the HBase Thrift Interface, Part 1: setting up, getting the language bindings, and connecting;
  2. How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows: using HBase’s Thrift API from Python

Original title and link: An intro to HBase’s Thrift interface (NoSQL database©myNoSQL)


Approaches to Backup and Disaster Recovery in HBase

This shouldmust be part of your HBase operational manual:

Let’s start with the least disruptive, smallest data footprint, least performance-impactful mechanism and work our way up to the most disruptive, forklift-style tool:

  • Snapshots
  • Replication
  • Export
  • CopyTable
  • HTable API
  • Offline backup of HDFS data

HBase backup strategies

When you return to the office after the winter holiday make sure you take a copy of this with you and pass it around.

Original title and link: Approaches to Backup and Disaster Recovery in HBase (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/


Dropbox: Challenges in mirroring large MySQL systems to HBase

A presentation by Todd Eisenberger about the archival system used by Dropbox based on MySQL and HBase:

MySQL benefits:

  • fast queries for known keys over a (relatively) small dataset
  • high read throughput

HBase benetits:

  • high write throughput
  • large suite of pre-existing tools for distributed computation
  • easier to perform large processing tasks

✚ Both are consistent

✚ Most of the benefits in HBase’s section point in the direction of data processing benefits (and not data storage benefits)


Apache HBase 0.96.0 released after more than 2000 issues resolved

This is a an important release for HBase. Both Hortonworks and Cloudera have posts covering it:

HBase 0.94 has been released over a year and a half ago.

Original title and link: Apache HBase 0.96.0 released after more than 2000 issues resolved (NoSQL database©myNoSQL)


Results of collaboration on improving the Mean Time to Recovery in HBase

Hortonworks, eBay and Scaled Risk have been collaborating in improving the mean time to recovery in HBase and after long testing performed at eBay, some results are now available for 2 scenarios:

  • Node/RegionServer failures while writing
  • Node/RegionServer failures while reading

Original title and link: Results of collaboration on improving the Mean Time to Recovery in HBase (NoSQL database©myNoSQL)


A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)


Why NoSQL Can Be Safer than an RDBMS

Robin Schumacher1:

That said, I disagree with many of the article’s statements, the most important being that companies should not consider NoSQL databases as a first choice for critical data. In this article, I’ll show first how a NoSQL database like Cassandra is indeed being used today as a primary datastore for key data and, second, that Cassandra can actually end up being safer than an RDBMS for important information.

You already know how this goes: “First they ignore you, then they laugh at you, then they fight you, then you win”. I’ll let you decide where major NoSQL databases are today.


  1. Robin Schumacher is VP of Products at DataStax. He’s also my boss

Original title and link: Why NoSQL Can Be Safer than an RDBMS (NoSQL database©myNoSQL)

via: http://www.datastax.com/2013/10/why-nosql-can-be-safer-than-an-rdbms


Quick intro to Apache Cassandra… comic style

You can find it here. Nice job by Alberto Diego Prieto Löfkrantz.

Original title and link: Quick intro to Apache Cassandra… comic style (NoSQL database©myNoSQL)