column store: All content tagged as column store in NoSQL databases and polyglot persistence
Friday, 24 May 2013
HBase migration to the new Hadoop Metrics2 system
Elliott Clarke explains a bit the work that his doing in migrating the HBase metrics to Hadoop’s Metrics2 system:
As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.
The post is more about the specific implementation details than the wide range of metrics HBase already supports and how this new system would unify and allow extending it.
Original title and link: HBase migration to the new Hadoop Metrics2 system (©myNoSQL)
via: https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics
Thursday, 23 May 2013
Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency
A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:
There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.
Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.
Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency (©myNoSQL)
via: http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/
Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite
Aleksey Yeschenko has an interesting post about the impact deletes can have on Cassandra and different workaround solutions:
Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.
I wouldn’t call this a “you got your data model wrong”, but rather a known implementation limitation that has impact on some scenarios in which a different data model should be used; the difference, while only semantic, is that the error is not on the user.
In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close “expiration date” together and getting rid of them in a single move.
Original title and link: Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite (©myNoSQL)
via: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
Monday, 20 May 2013
The Master-Slave Architecture of HBase
Fantastic post by Matteo Bertozzi looking at HBase’s master-slave architecture:
At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.
Original title and link: The Master-Slave Architecture of HBase (©myNoSQL)
via: https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master
Friday, 12 April 2013
HBase Data Modeling Tips & Tricks - Timeshifting
Jeff Kolesky describing the data model they are using with HBase and one (strange) trick to reduce the roundtrips to the database:
The idea is to put all of the data about a single entity into a single row in HBase. When you need to run a computation that involves that entity’s data, you have quick access to it by the row key, and all of the data is stored close together on disk.
Additionally, against many suggestions from the HBase community, and general confusion about how timestamps work, we are using timestamps with logical values. Instead of just letting the region server assign a timestamp version to each cell, we are explicitly setting those values so that we can use timestamp as a true queryable dimension in our gets and scans.
In addition to the real timeseries data that is indexed using the cell timestamp, we also have other columns that store metadata about the entity.
It’s amazing how many smart and weird tricks engineers put in their production systems when having to deal with real requirements and SLAs.
Original title and link: HBase Data Modeling Tips & Tricks - Timeshifting (©myNoSQL)
via: http://www.heyitsopower.com/code/timeshifting-in-hbase/
Thursday, 4 April 2013
Kairosdb - Fast Scalable Time Series Database
kairosdb is introduced as a rewrite of the OpenTSDB written primarily for Cassandra (nb: OpenTSDB was based on HBase). In terms of what it brings new, this page lists:
- Uses Guice to load modules.
- Incorporates Jetty for Rest API and serving up UI.
- Pure Java build tool (Tablesaw)
- UI uses Flot and is client side rendered.
- Ability to customize UI.
- Relative time now includes month and supports leap years.
- Modular data store interface supports:
- HBase
- Cassandra
- H2 (For development)
- Milliseconds data support when using Cassandra.
- Rest API for querying and submitting data.
- Build produces deployable tar, rpm and deb packages.
- Linux start/stop service scripts.
- Faster.
- Made aggregations optional (easier to get raw data).
- Added abilities to import and export data.
- Aggregators can aggregate data for a specified period.
- Aggregators can be stacked or “piped” together.
Source code lives on GitHub. Let’s see where it goes.
Original title and link: Kairosdb - Fast Scalable Time Series Database (©myNoSQL)
Wednesday, 3 April 2013
5 Steps to Benchmarking Managed NoSQL - DynamoDB Vs Cassandra
Ben Bromhead (instaclustr) for High Scalability:
To determine the suitability of a provider, your first port of call is to benchmark. Choosing a service provider is often done in a number of stages. First is to shortlist providers based on capabilities and claimed performance, ruling out those that do not meet your application requirements. Second is to look for benchmarks conducted by third parties, if any. The final stage is to benchmark the service yourself.
✚ Peter Bailis asks a very valid question: if it’s the default YCSB and it’s a benchmark, where are the results?”
✚ instaclustr offers a totally managed hosting solution for Cassandra. (Disclaimer: they’ve sponsored myNoSQL in the past)
Original title and link: 5 Steps to Benchmarking Managed NoSQL - DynamoDB Vs Cassandra (©myNoSQL)
Tuesday, 2 April 2013
Improving Secondary Index Write Performance in Cassandra 1.2
Sam Tunnicliffe’s describes the old and new, optimized behavior of secondary indexes writes in Cassandra 1.2:
While secondary indexes can add a lot of flexibility to the way data is modelled and accessed, they do add complexity on the server side as the indexes need to be kept in sync with the primary data. Until recently, this has led to some significant trade offs in write throughput and IO utilisation as we always had to perform a read before the write in order to update any relevant secondary indexes. In Cassandra 1.2, this area has been substantially reworked to remove the need for read-before-write. New index entries are now written at the same time as the primary data is updated and old entries removed lazily at query time. Overall, this has lead to some decent performance improvements.
Original title and link: Improving Secondary Index Write Performance in Cassandra 1.2 (©myNoSQL)
via: http://www.datastax.com/dev/blog/improving-secondary-index-write-performance-in-1-2
Thursday, 28 March 2013
Graph Based Recommendation Systems at eBay
Slidedeck from eBay explaining how they have implemented a graph based recommendation system based on,—surprise! not a graph database—Cassandra.
Original title and link: Graph Based Recommendation Systems at eBay (©myNoSQL)
Wednesday, 27 March 2013
HBase Compactions Q&A
Ted Yu summarizes some of the most frequent questions related to compactions in HBase:
On user mailing list, questions about compaction are probably the most frequently asked.
Original title and link: HBase Compactions Q&A (©myNoSQL)
via: http://zhihongyu.blogspot.com/2013/03/compactions-q.html
Wednesday, 13 March 2013
RSS Reader With Cassandra and Netflix OSS Tools
This RSS reader app from Netflix can be a very good excuse to use Cassandra, some of the open source projects from Netflix and why not create an alternative to Google’s Reader which is declared defunct or alive every couple of months:
Projects you’ll use: Cassandra with Astyanax, Archaius, Blitz4j, Eurka, Governator, Hystrix, Karyon, Ribbon, Servo. As for myself, I’ve already checked out the code.
Original title and link: RSS Reader With Cassandra and Netflix OSS Tools (©myNoSQL)
via: http://techblog.netflix.com/2013/03/introducing-first-netflixoss-recipe-rss.html
Tuesday, 12 March 2013
Cassandra at Adobe: The Profile Cache Servers
The team I know at Adobe has invested a lot into HBase and they are offering their services globally. But according to this PDF, in a true polyglot database manner, it looks like other parts of the Adobe business have opted for a different solution: Cassandra. The size of the cluster mentioned in the whitepaper is pretty small, 16 nodes, but what is interesting is that these are beafy servers using solid state drives:
The PCS is comprised of large servers using solid state drives (SSDs) for storage […] The PCS is basically Cassandra with a set of custom APIs built on top of it.
Original title and link: Cassandra at Adobe: The Profile Cache Servers (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
