ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

BigTable: All content tagged as BigTable in NoSQL databases and polyglot persistence

Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS

Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:

  1. Pig and Hive
  2. HBase
  3. Elastic Search
  4. MongoDB
  5. MySQL

Klout Data Architecture


Configuring HBase Memstore: What You Should Know

A very well documented post by Alex Baranau about HBase Memstore, HBase write and read operations and the importance of correctly configuring Memstore:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

hbase_read_write_path2_small

Original title and link: Configuring HBase Memstore: What You Should Know (NoSQL database©myNoSQL)

via: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/


Where Cassandra Really Shines

Steve Corona on Hacker News:

Where Cassandra REALLY shines and is often overlooked is ease of maintenance. Cassandra’s ability to bootstrap new nodes, replicate, reshard and handle down nodes (w/ hinted handoff) is almost magical. I use it in production and it works very reliably.

Sure, it’s got some cool big data stuff, but try doing any of those “maintenance” operations on other databases without ripping your hair out. For example, even bringing up a new MySQL slave is a huge pain in the ass, let alone doing something non-trivial like promoting a new master.

Reinforcing exactly what I emphasized as merits of NoSQL systems in is SQL or NoSQL better for programmers.

Original title and link: Where Cassandra Really Shines (NoSQL database©myNoSQL)


eBay's Cassandra Data Modeling Best Practices

Jay Patel (architect at eBay):

Our Cassandra deployment is not huge, but it’s growing at a healthy pace. In the past couple of months, we’ve deployed dozens of nodes across several small clusters spanning multiple data centers. You may ask, why multiple clusters? We isolate clusters by functional area and criticality. Use cases with similar criticality from the same functional area share the same cluster, but reside in different keyspaces.

This first post is focused on two old techniques that have been applied even with relational databases:

  1. model data around query patterns
  2. de-normalize and duplicate for read performance.

Original title and link: eBay’s Cassandra Data Modeling Best Practices (NoSQL database©myNoSQL)

via: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/


From MongoDB to Cassandra: Why Atlas Platform Is Migrating

Sergio Bossa tells the story of migrating the Atlas platform from using MongoDB to Cassandra emphasizing the reasons behind their decision:

  • It works on the JVM, and we have lots of in-house experience on it.
  • It scales in terms of processing and storage capacity.
  • Its column-based data model gives us some advanced capabilities we will talk about in a few minutes.
  • Its tunable consistency levels provide greater control over high availability and consistency requirements.

As regards what made them look into a different solution:

  • We need higher resiliency to faults: MongoDB provides replica sets, but we’re experiencing lots of problems with replication lags and during replica synchronization.
  • We need higher scalability: MongoDB global lock and huge memory requirements aren’t already going to cope well with our growing data set.

Original title and link: From MongoDB to Cassandra: Why Atlas Platform Is Migrating (NoSQL database©myNoSQL)

via: http://metabroadcast.com/blog/looking-with-cassandra-into-the-future-of-atlas


How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)

via: http://tech.flurry.com/137492485


HBase HFile Explained

This is probably the most comprehensible and complete articles about how HBase is storing data:

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.

Original title and link: HBase HFile Explained (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)


Performance Evaluation of HBase and How Hardware Changes Results

Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.

Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (NoSQL database©myNoSQL)


Using R With Cassandra Through JDBC or Hive

A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.

This is a good example of what I meant by designing products with openness and integration in mind.

Original title and link: Using R With Cassandra Through JDBC or Hive (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive


HBase 0.94 Released: What’s New

With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:

  • Read caching improvements
  • Seek optimizations
  • WAL writes optimizations
  • added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
  • simplified region sizing
  • atomic Put & Delete in a single transaction

Original title and link: HBase 0.94 Released: What’s New (NoSQL database©myNoSQL)


Cassandra at Workware Systems: Data Model FTW

One of the stories in which the deciding factor for using Cassandra was primarily the data model and not its scalability characteristics:

We started working with relational databases, and began building things primarily with PostgreSQL at first.  But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra.

Original title and link: Cassandra at Workware Systems: Data Model FTW (NoSQL database©myNoSQL)

via: http://www.datastax.com/2012/04/the-five-minute-interview-workware-systems