ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

cassandra: All content tagged as cassandra in NoSQL databases and polyglot persistence

A Tour of Amazon DynamoDB Features and API

Mathias Meyer’s walk through the DynamoDB features and API with commentary:

Sorted range keys, conditional updates, atomic counters, structured data and multi-valued data types, fetching and updating single attributes, strong consistency, and no explicit way to handle and resolve conflicts other than conditions. A lot of features DynamoDB has to offer remind me of everything that’s great about wide column stores like Cassandra, but even more so of HBase. This is great in my opinion, as Dynamo would probably not be well-suited for a customer-facing system. And indeed, Werner Vogel’s post on DynamoDB seems to suggest DynamoDB is a bastard child of Dynamo and SimpleDB, though with lots of sugar sprinkled on top.

Think of it as an extended, better articulated and closer to the API version of my notes about Amazon DynamoDB.

Original title and link: A Tour of Amazon DynamoDB Features and API (NoSQL database©myNoSQL)

via: http://www.paperplanes.de/2012/1/30/a-tour-of-amazons-dynamodb.html


Automating Cassandra Operations and Management With Netflix's Priam Tool

A new open source tool from Netflix, Priam—back in November, Netflix has released Curator, a ZooKeeper library—, used to simplify and automate the operations and management of a Cassandra cluster:

Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:

  • Backup and recovery
    • snapshot and incremental backups
    • compression and multipart off-site uploading
    • data recovery and data testing
  • Bootstrapping and automated token assignment

    Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.

  • Centralized configuration management: All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.

  • RESTful monitoring and metrics: provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra’s ring information. They also expose key Cassandra JMX commands such as repair and refresh.

Original title and link: Automating Cassandra Operations and Management With Netflix’s Priam Tool (NoSQL database©myNoSQL)

via: http://techblog.netflix.com/2012/02/announcing-priam.html


Dealing With JVM Limitations in Apache Cassandra

A couple of most notable NoSQL databases targeting large scalable systems are written in Java: Cassandra, HBase, BigCouch. Then there’s also Hadoop. Plus a series of caching and data grid solutions like Terracotta, Gigaspaces. They are all facing the same challenge: tuning the JVM garbage collector for predictable latency and throughput.

Jonathan Ellis’s slides presented at Fosdem 2012 are covering some of the problems with GC and the way Cassandra tackles them. While this is one of those presentations where the slides are not enough to understand the full picture, going through them will still give you a couple of good hints.

For those saying that Java and the JVM are not the platform for writing large concurrent systems, here’s the quote Ellis is finishing his slides with:

Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.

Enjoy the slides after the break.


A Question About NoSQL Managed Hosting

It’s impossible to always have the right answers to all the questions. So this time I’ll have to ask you all: why only some NoSQL databases are present in managed hosting offers?

The first wave of NoSQL managed hosting services brought MongoDB, CouchDB, and some Redis. The second wave brought some more MongoDB, CouchDB, and just a bit more of Redis. It was only the third wave that brought some managed services for graph databases: Neo4j and OrientDB. Plus the first proposal for Cassandra managed hosting.

The first answer that comes to mind when thinking about NoSQL managed services is adoption. If a product is not in wide use then the chances for a company to run a profitable hosting business are very low. But I have the feeling that this is not the only or the complete answer.

Please chime in and share your thoughts.

Original title and link: A Question About NoSQL Managed Hosting (NoSQL database©myNoSQL)


Cassandra at Clearspring with Chris Burroughs - Powered by NoSQL

For today’s Powered by Cassandra video from the Cassandra NYC 2011 event organized by DataStax, I chose Chris Burroughs’s presentation about Clearspring’s usage of Cassandra. Just in case you wonder what Clearspring is doing, the sharing buttons you see here on myNoSQL are powered by AddThis product from Clearspring.


Cassandra 101 for System Administrators with Nathan Milford - Powered by NoSQL

While today was supposed to be a new educational video from the Cassandra NYC 2011 video series, I thought that learning from the lessons of operating Cassandra at Outbrain to serve over 30 billion impressions monthly will be quite educational.


The Future of Big Data with Cassandra

One of the best presentations I’ve seen: concise, covering the topic from different angles, providing useful information, pitching a product and company in non-obtrusive ways.

The slidedeck by Matthew F. Dennis talks about realtime data and analytics from the perspective of Cassandra and DataStax. It starts by presenting the most important features of Cassandra:

  • true multi DC support
  • no SPOF
  • linear scalability
  • great read and write performance
  • tunable consistency access
  • durable
  • integrated caching

and a series of use cases for Cassandra:

  • time series
  • sensor data
  • messaging
  • ad tracking
  • financial market data
  • user activity streams
  • fraud detection
  • risk analysis

It then summarizes three major Cassandra case studies with quotes emphasizing why Cassandra plays a critical role in each of them:

  • Netflix
  • Backupify
  • ooyala

Enjoy it after the break.


Lessons in Data Visualization: How to create a visualization

Pete Warden:

Pick a question. Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization.

Oftentimes, you might be tempted to start with an answer in the form of a hypothesis or preconception. The results will get might be valid but radically different.

As for the technologies used for data crunching, it’s Pig on Hadoop over a Cassandra cluster:

In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I’ve been running Pig analytics jobs regularly to get a view of what we have in there. […] In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.

Original title and link: Lessons in Data Visualization: How to create a visualization (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html


Cassandra and MongoDB with Gigaspaces Cloudify

There are two reasons I’m writing about Gigaspaces’s Cloudify (PR announcement):

  1. Besides MySQL, Cloudify recipes include Cassandra and MongoDB.

    Also a bit of vintage claim chowder: if you remember Mike Gaultieri’s (Forrester) NoSQL wants to be elastic caching when it grows up, this should be a clear proof he was wrong.

  2. Gigaspaces is starting to realize that it’s not really necessary to claim a NoSQL affiliation for benefitting of the NoSQL buzz. Clear market positioning and smartly showcasing it is much more useful for the potential customers. The other company showing it learned this lesson is Terracotta1.


  1. I’m probably biased on this as I was responsible for talking to Terracotta folks about this better route. 

Original title and link: Cassandra and MongoDB with Gigaspaces Cloudify (NoSQL database©myNoSQL)


Hosted and Managed NoSQL: Cassandra, Redis, OrientDB

In the last few days I’ve read about some new NoSQL hosting solutions:

  • Cassandra: managed hardware & software hosting:

    Per node:

    • Intel Dual Quad-core (8 cpu’s), 16gb of memory, 2tb primary storage + 500gb commitlog drive
    • 5 public ip addresses, 1000Mbps private network port.
    • Debian, CentOS, RedHat or FreeBSD
    • Cassandra setup, configuration and ongoing maintenance (repairs, cleanups, troubleshooting)
    • Cassandra upgrades (rolling restart)
    • 24x7 real-time monitoring (load, tcp, jmx and cassandra logs)
    • Multi-datacenter environment (we’ll spread your cluster across two or three geographic locations, based on your needs)
    • 30 days test drive

    Cost: $850/monthly per node (5tb bandwidth, includes backups & monitoring)

  • OrientDB: NuvolaBase

    • Real-time replicated deployment
    • Managed
    • JSON over HTTP access
    • can offer VPN connections to the cluster
  • Redis: Cloudnode

    • Cloudeno.de is still in beta
    • “one Redis instance free with every Cloudnode account”, but no further details about the characteristicts of the instance

Hosting for NoSQL databases has been available in some form or another for a while, but only for the most popular ones (MongoDB, CouchDB, Redis). Things are changing fast. Neo4j is advertising heavily the Heroku add-on, OrientDB got NuvolaBase, and so on.

This is the market that Amazon is targeting with Amazon RDS, SimpleDB, and DynamoDB: the managed data services and that as part of a bigger strategy. What should be clear is that Amazon is not after NoSQL database companies.

Anyone considering a business in the managed data services market should realize that Amazon will not get into supporting all the NoSQL databases out there. They’d also better take a deep look and learn from what Amazon is offering with SimpleDB and DynamoDB.

Original title and link: Hosted and Managed NoSQL: Cassandra, Redis, OrientDB (NoSQL database©myNoSQL)


Scaling Video Analytics with Cassandra by Ilya Maykov - Powered by NoSQL

To keep with last week’s model—an educational video about Cassandra, followed by a Cassandra case study—today’s video in the Cassandra NYC 2011 video series from DataStax, is Ilya Maykov describe how Cassandra is used at Ooyala for computing multi-dimensional video analytics reports for 100M+ monthly unique users in near-real-time.


Cassandra Data Modeling Examples with Matthew F. Dennis - NoSQL videos

Continuing the Cassandra NYC 2011 video series, made available by the folks from DataStax, this week we have Matthew F. Dennis which covers a couple of different Cassandra data modeling use cases.