bigtable: All content tagged as bigtable in NoSQL databases and polyglot persistence
Monday, 1 October 2012
$25 Million in C Round for DataStax
I’d say that raising another $25 million from Meritech Capital Partners and with the participation of existing investors Lightspeed Venture Partners and Crosslink Capital is a good enough reason for DataStax to party.
DataStax will use the funds to further enhance its Big Data platform and increase the value for current customers while driving global customer acquisition.
Congrats to DataStax and Cassandra community!
Original title and link: $25 Million in C Round for DataStax (©myNoSQL)
Tuesday, 25 September 2012
Doing Redundant Work to Speed Up Distributed Queries
Great post by Peter Bailis looking at how some systems are reducing tail latency by distributing reads across nodes:
Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes. (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.
Original title and link: Doing Redundant Work to Speed Up Distributed Queries (©myNoSQL)
via: http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/
Monday, 3 September 2012
Reddit’s Database Has Two Tables
Considering the fast evolution of NoSQL databases, the topic is now very old (from 2010). But read the comments on the original post, Hacker News, and Reddit to see what people think today about extreme denormalization, schemas, relational and NoSQL databases.
Original title and link: Reddit’s Database Has Two Tables (©myNoSQL)
via: http://kev.inburke.com/kevin/reddits-database-has-two-tables/
Tuesday, 7 August 2012
Latency-Consistency Analysis
A very interesting proposal and patch for enhancing nodetool to provide cluster latency-consistency analysis. From JIRA:
We’ve implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we’ve used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).
This analysis is important for the many users we’ve talked to and heard about who use “partial quorum” operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there’s no existing way to answer these questions.
Original title and link: Latency-Consistency Analysis (©myNoSQL)
Accumulo, HBase, Cassandra and Some Unanswered Questions
Cade Metz for Wired:
The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.
Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.
Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/
Monday, 6 August 2012
Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
A Big Data Trifecta: Storm, Kafka and Cassandra
Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:
Now, Kafka is fast. When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”. When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.
Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (©myNoSQL)
via: http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
Sunday, 5 August 2012
Cassandra at Scandit
We use Cassandra in two ways: First, it holds our product database. Second, we use it to store and analyze the scans generated by the apps that integrate the Barcode Scanner SDK. We call this Scanalytics.
Scanalytics is a web-based analytics platform that lets app developers see what happens in their app: What kind of products do their users scan? Groceries, electronics, cosmetics, etc.? Where do they scan? At home? In the retail store? And so on. All that goes into Cassandra.
The Product database has 25 million records, so you could probably do it with any database. But I’d be interested to learn how data is modeled in Scanalytics.
Original title and link: Cassandra at Scandit (©myNoSQL)
via: http://www.datastax.com/dev/blog/the-five-minute-interview-scandit
Tuesday, 31 July 2012
The Benefits of Virtual Nodes and Performance Results
Sam Overton and Tom Wilkie of Acunu explain the advantages of using virtual nodes in distributed data storage engines and the performance they’ve measure introducing virtual nodes in Acunu platform when compared with Apache Cassandra:
One of the factors that limits the amount of data that can be stored on each node is the amount of time it takes to re-replicate that data when a node fails. That time matters, because it is a period during which the cluster is more vulnerable than normal to data loss. The challenge is that the more data stored on a node, the longer it takes to re-replicate it. Therefore, to store more data per node safely, we want to reduce the time taken to return to normal. This was one of our aims with virtual nodes.
Virtual Nodes reduces the time taken to re-replicate data as it involves every node in the cluster in the operation. In contrast, Apache Cassandra v1.1 will only involve a number of nodes equal to the Replication Factor (RF) of your keyspace. What’s more, with Virtual Nodes, the cluster remains balanced after this operation - you do not need to shuffle the tokens on the other nodes to compensate for the loss!
Original title and link: The Benefits of Virtual Nodes and Performance Results (©myNoSQL)
via: http://www.acunu.com/2/post/2012/07/virtual-nodes-performance-results.html
Friday, 20 July 2012
EC2 Solid State Disks and Cassandra
Jonathan Ellis about using Cassandra with mixed spinning disks and SSDs:
Finally, I should point out that taking advantage of SSDs in a Cassandra cluster doesn’t have to be all or nothing. You can mix SSD and spinning disks either at the individual node level, or at the cluster level. For the former, Cassandra allows putting “hot” tables on SSD while leaving “cold” ones on spinning disks. But if you want to use a group of nodes for analytical workloads the way DataStax Enterprise does, Cassandra will also be comfortable with having just those nodes be entirely based on cheaper spinning disks, with the remaining, “realtime” nodes based on SSDs. This latter configuration is a good fit for EC2 deployments.
Original title and link: EC2 Solid State Disks and Cassandra (©myNoSQL)
via: http://www.datastax.com/dev/blog/solid-state-disks-now-available-on-amazon-ec2
Cassandra and Solid State Drives
A slide deck by Rick Branson explaining why and how Cassandra takes full advantage of SSDs.
Wednesday, 18 July 2012
Benchmarking High Performance I/O With SSD for Cassandra on AWS
Adrian Cockcroft:
The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.
Tons of details and data about the benchmarks Netflix ran against the new high I/O SSD-backed EC2 instances. Results are even more impressive than the IOPS numbers in Werner Vogel’s High performance I/O instances for EC2.
Original title and link: Benchmarking High Performance I/O With SSD for Cassandra on AWS (©myNoSQL)
via: http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling

