bigtable: All content tagged as bigtable in NoSQL databases and polyglot persistence
Tuesday, 15 January 2013
Cassandra at MetricsHub for Cloud Monitoring
Charles Lamanna (CEO MetricsHub):
We use Cassandra for recording time series information (e.g. metrics) as well as special events (e.g. server failure) for our customers. We have a multi-tenant Cassandra cluster for this. We record over 16 data points per server per second, 24 hours a day, 7 days a week. We use Cassandra to store and crunch this data.
Many of the NoSQL databases can be used for monitoring. For example for small scale self-monitoring you could use Redis.
Original title and link: Cassandra at MetricsHub for Cloud Monitoring (©myNoSQL)
via: http://www.planetcassandra.org/blog/post/5-minute-interview-metricshub
Wednesday, 21 November 2012
Cassandra Application Performance Management With Request Tracing
Jonathan Ellis introduces in two posts—here and here—a new feature in Cassandra 1.2: request tracing. Basically such a feature is an improved approach over more generic APM tools like AppDynamics or NewRelic.
Be judicious with this: tracing a request will usually requre at least 10 rows to be inserted, so it is far from free. Unless you are under very light load tracing all requests (probability 1.0) will probably overwhelm your system. I recommend starting with a small fraction, e.g. 0.001 and increasing that only if necessary.
Years ago I had to implement myself a tracing layer1, after trying to get information from that system using some commercial tools—I’m sure these got better since then though. There were a few goals I’ve planned for and there were many things I’ve learned after deploying it live:
- granularity of the probes is critical to understanding how the system behaves. Use too coarse grained probes and you’ll miss important details, use too fine grained probes and you’ll be flooded with unusable data
- deciding if traces are persistent or volatile and the impact on the system performance. Should you be able to retrieve older traces? If persistent, do they contain enough information to help explain a specific behavior? Can they be used to replay a scenario?
- deciding what requests should be traced and when? Tracing comes with a cost and you must try to minimize the impact it has on the system. The most important data is needed when the system misbehaves or is under high load, but that’s the same time additional work could bring it down
- probabilistic vs pattern vs behavioral tracing. Generic solutions have no knowledge of the system, but a custom one could be created
- trace ordering. Can historical tracing information be ordered?
And there are probably many other things that I don’t remember right anymore.
-
My implementation was specific to the system (in the sense that it had different tracing capabilities based on request types), but it was generic enough to allow us to change the granularity of collected probes, introduce new trace points, and also change the ratio of the requests to be traced. ↩
Original title and link: Cassandra Application Performance Management With Request Tracing (©myNoSQL)
Tuesday, 20 November 2012
Cassandra Query Language CQL3 Explained
CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way.
CQL seems to be the solution Cassandra is using to address the sometimes confusing or complex data model. I also think that CQL is an attempt of bringing Cassandra closer to SQL-enabled tools, a feature that might allow more integrations in the future.
Original title and link: Cassandra Query Language CQL3 Explained (©myNoSQL)
Monday, 19 November 2012
HBase Roadmap
Deveraj Das’s post on Hortonworks blog details the current and future work on HBase:
- Reliability and High Availability (all data always available, and recovery from failures is quick)
- Autonomous operation (minimum operator intervention)
- Wire compatibility (to support rolling upgrades across a couple of versions at least)
- Cross data-center replication (for disaster recovery)
- Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
- Monitoring and Diagnostics (which regionserver is hot or what caused an outage)
Future:
- Better and improved clients (asynchronous clients, and, in multiple languages)
- Cell-level security (access control for every cell in a table)
- Multi-tenancy (HBase becomes a viable shared platform for multiple applications using it)
- Secondary indexing functionality
Current work=reliability. Future work=usability.
Original title and link: HBase Roadmap (©myNoSQL)
Thursday, 25 October 2012
YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak
Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.
Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (©myNoSQL)
Tuesday, 23 October 2012
Improving HBase Read Performance at Facebook
Starting from Hypertable v HBase benchmark and building on the things HBase could learn from it, the Facebook team set to improve the read performance in HBase. And they’ve accomplished it:
Original title and link: Improving HBase Read Performance at Facebook (©myNoSQL)
Tuesday, 2 October 2012
Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (©myNoSQL)
Monday, 1 October 2012
$25 Million in C Round for DataStax
I’d say that raising another $25 million from Meritech Capital Partners and with the participation of existing investors Lightspeed Venture Partners and Crosslink Capital is a good enough reason for DataStax to party.
DataStax will use the funds to further enhance its Big Data platform and increase the value for current customers while driving global customer acquisition.
Congrats to DataStax and Cassandra community!
Original title and link: $25 Million in C Round for DataStax (©myNoSQL)
Tuesday, 25 September 2012
Doing Redundant Work to Speed Up Distributed Queries
Great post by Peter Bailis looking at how some systems are reducing tail latency by distributing reads across nodes:
Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes. (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.
Original title and link: Doing Redundant Work to Speed Up Distributed Queries (©myNoSQL)
via: http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/
Monday, 3 September 2012
Reddit’s Database Has Two Tables
Considering the fast evolution of NoSQL databases, the topic is now very old (from 2010). But read the comments on the original post, Hacker News, and Reddit to see what people think today about extreme denormalization, schemas, relational and NoSQL databases.
Original title and link: Reddit’s Database Has Two Tables (©myNoSQL)
via: http://kev.inburke.com/kevin/reddits-database-has-two-tables/
Tuesday, 7 August 2012
Latency-Consistency Analysis
A very interesting proposal and patch for enhancing nodetool to provide cluster latency-consistency analysis. From JIRA:
We’ve implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we’ve used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).
This analysis is important for the many users we’ve talked to and heard about who use “partial quorum” operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there’s no existing way to answer these questions.
Original title and link: Latency-Consistency Analysis (©myNoSQL)
Accumulo, HBase, Cassandra and Some Unanswered Questions
Cade Metz for Wired:
The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.
Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.
Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
