bigtable: All content tagged as bigtable in NoSQL databases and polyglot persistence
Tuesday, 17 July 2012
Where Cassandra Really Shines
Where Cassandra REALLY shines and is often overlooked is ease of maintenance. Cassandra’s ability to bootstrap new nodes, replicate, reshard and handle down nodes (w/ hinted handoff) is almost magical. I use it in production and it works very reliably.
Sure, it’s got some cool big data stuff, but try doing any of those “maintenance” operations on other databases without ripping your hair out. For example, even bringing up a new MySQL slave is a huge pain in the ass, let alone doing something non-trivial like promoting a new master.
Reinforcing exactly what I emphasized as merits of NoSQL systems in is SQL or NoSQL better for programmers.
Original title and link: Where Cassandra Really Shines (©myNoSQL)
Monday, 16 July 2012
eBay's Cassandra Data Modeling Best Practices
Jay Patel (architect at eBay):
Our Cassandra deployment is not huge, but it’s growing at a healthy pace. In the past couple of months, we’ve deployed dozens of nodes across several small clusters spanning multiple data centers. You may ask, why multiple clusters? We isolate clusters by functional area and criticality. Use cases with similar criticality from the same functional area share the same cluster, but reside in different keyspaces.
This first post is focused on two old techniques that have been applied even with relational databases:
- model data around query patterns
- de-normalize and duplicate for read performance.
Original title and link: eBay’s Cassandra Data Modeling Best Practices (©myNoSQL)
via: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Thursday, 12 July 2012
From MongoDB to Cassandra: Why Atlas Platform Is Migrating
Sergio Bossa tells the story of migrating the Atlas platform from using MongoDB to Cassandra emphasizing the reasons behind their decision:
- It works on the JVM, and we have lots of in-house experience on it.
- It scales in terms of processing and storage capacity.
- Its column-based data model gives us some advanced capabilities we will talk about in a few minutes.
- Its tunable consistency levels provide greater control over high availability and consistency requirements.
As regards what made them look into a different solution:
- We need higher resiliency to faults: MongoDB provides replica sets, but we’re experiencing lots of problems with replication lags and during replica synchronization.
- We need higher scalability: MongoDB global lock and huge memory requirements aren’t already going to cope well with our growing data set.
Original title and link: From MongoDB to Cassandra: Why Atlas Platform Is Migrating (©myNoSQL)
via: http://metabroadcast.com/blog/looking-with-cassandra-into-the-future-of-atlas
How to Organize Your HBase Keys
The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]
As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.
This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.
Original title and link: How to Organize Your HBase Keys (©myNoSQL)
Monday, 9 July 2012
HBase HFile Explained
This is probably the most comprehensible and complete articles about how HBase is storing data:
Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.
Original title and link: HBase HFile Explained (©myNoSQL)
via: http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
Friday, 15 June 2012
Hortonworks Data Platform 1.0
Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.
Some info points:
-
Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop
- HDP 1.0 is based on Apache Hadoop 1.0
- Apache Ambari is used for installation and provisioning
- The same Apache Amabari is behind the Hortonworks Management Console
- For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
- Apache HCatalog is the solution offering metadata and table management
-
Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community
- HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster
One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.
You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.
Original title and link: Hortonworks Data Platform 1.0 (©myNoSQL)
Friday, 8 June 2012
Performance Evaluation of HBase and How Hardware Changes Results
Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.
Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (©myNoSQL)
Thursday, 24 May 2012
Using R With Cassandra Through JDBC or Hive
A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.
This is a good example of what I meant by designing products with openness and integration in mind.
Original title and link: Using R With Cassandra Through JDBC or Hive (©myNoSQL)
via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
Thursday, 17 May 2012
HBase 0.94 Released: What’s New
With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:
- Read caching improvements
- Seek optimizations
- WAL writes optimizations
- added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
- simplified region sizing
- atomic Put & Delete in a single transaction
Original title and link: HBase 0.94 Released: What’s New (©myNoSQL)
Wednesday, 16 May 2012
Cassandra at Workware Systems: Data Model FTW
One of the stories in which the deciding factor for using Cassandra was primarily the data model and not its scalability characteristics:
We started working with relational databases, and began building things primarily with PostgreSQL at first. But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra.
Original title and link: Cassandra at Workware Systems: Data Model FTW (©myNoSQL)
via: http://www.datastax.com/2012/04/the-five-minute-interview-workware-systems
Thursday, 10 May 2012
NoSQL and Relational Databases Podcast With Mathias Meyer
EngineYard’s Ines Sombra recorded a conversation with Mathias Meyer about NoSQL databases and their evolution towards more friendlier functionality, relational databases and their steps towards non-relational models, and a bit more on what polyglot persistence means.
Mathias Meyer is one of the people I could talk for days about NoSQL and databases in general with different infrastructure toppings and he has some of the most well balanced thoughts when speaking about this exciting space—see this conversation I’ve had with him in the early days of NoSQL. I strongly encourage you to download the mp3 and listen to it.
Original title and link: NoSQL and Relational Databases Podcast With Mathias Meyer (©myNoSQL)
Monday, 7 May 2012
Cassandra 1.1 Released: What’s New
There are a lot of interesting new features and improvements in the newly released Cassandra 1.1 version to cover them all here, but here’s the gist of them:
- Schema improvements
- Support for compound keys
- Concurrent schema changes
- A new version of Cassandra Query Language (CQL3) supporting compound keys and wide rows
- Better and easier tuning of the key and row caches
- Support for per-table hybrid storage —mixing SSDs and spinning disks
This DataStax’s blog entry provides links to more details about all these features and the others I haven’t enumerated above.
Original title and link: Cassandra 1.1 Released: What’s New (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
