ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

column store: All content tagged as column store in NoSQL databases and polyglot persistence

Cloudera Announces Support for Apache Accumulo - what, how, why

Cloudera, the leader in enterprise analytic data management powered byApache Hadoop™, today announced its formal support for, and integration with, Apache Accumulo, a highly distributed, massively parallel processing database that is capable of analyzing structured and unstructured data and delivers fine-grained user access control and authentication. Accumulo uniquely enables system administrators to assign data access at the cell- level, ensuring that only authorized users can view and manipulate individual data points. This increased control allows a database to be accessed by a maximum number of users, while remaining compliant with data privacy and security regulations.

What about HBase?

Mike Olson:

It offers a strong complement to HBase, which has been part of our CDH offering since 2010, and remains the dominant high-performance delivery engine for NoSQL workloads running on Hadoop. However, Accumulo was expressly built to augment sensitive data workloads with fine-grained user access and authentication controls that are of mission-critical importance for federal and highly regulated industries.

The way I read this is: if you don’t need security go with HBase. If you need advanced security features you go with Accumulo.

How?

While there aren’t any details about what formal support means, I assume Cloudera will start offering Accumulo as an alternative to HBase.

CE_diagram

I might be wrong though about Accumulo being a replacement for HBase. I’d love to learn how and why the 2 would co-exist.

Why?

The obvious reason is that Cloudera wants to get into government and super-regulated markets contracts where security is a top requirement.

Another reason might be that Cloudera is continuing to expand its portfolio to catch as many customers as possible. Something à la Oracle or IBM. The alternative would be to stay focused. Like Teradata.

Original title and link: Cloudera Announces Support for Apache Accumulo (NoSQL database©myNoSQL)

via: http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1859607


Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0

The evolution from the original paper to Cassandra 2.0 in an interesting format:

The release of Apache Cassandra 2.0 is a good point to look back at the past five years of progress after Cassandra’s release as open source. Here, we annotate the Cassandra paper from LADIS 2009 with the new features and improvements that have been added since.

Original title and link: Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0 (NoSQL database©myNoSQL)

via: http://www.datastax.com/documentation/articles/cassandra/cassandrathenandnow.html


Hoya, HBase on YARN, Architecture

The architecture of HBase on top of YARN, a project named Hoya:

Hoya-Application-Architecture

The main question I had about what YARN would bring to HBase is answered in the post. But I’m still not sure I get the whole picture of how YARN improves HBase’s availability (if it does it):

YARN keeps an eye on the health of the containers, telling the AM when there is a problem. It also monitors the Hoya AM itself. When the AM fails, YARN allocates a new container for it, and restarts it. This provides an availability solution to Hoya without it having to code it in itself.

Original title and link: Hoya, HBase on YARN, Architecture (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/


Considering TokuDB as an engine for timeseries data... or Cassandra or OpenTSDB

Vadim Tkachenko:

  • Provide high insertion rate
  • Provide a good compression rate to store more data on expensive SSDs
  • Engine should be SSD friendly (less writes per timeperiod to help with SSD wear)
  • Provide a reasonable response time (within ~50 ms) on SELECT queries on hot recently inserted data

Looking on these requirements I actually think that TokuDB might be a good fit for this task.

There are solutions in the NoSQL space that are optimized for this scenario: Cassandra or OpenTSDB. Indeed using one of these will have an impact on the application side.

Most of the time when the requirements dictate looking into different solutions, the easiest to estimate is the initial costs: development (nb: this doesn’t include only pure development, but also learning costs, etc.) and hardware costs.

Unfortunately many times we ignore taking into consideration long term costs:

  • maintenance costs (hardware, operations, enhancements)
  • opportunity costs (features that the current architecture won’t be able to support as being either impossible or too expensive)
  • accounting for the risks of failed initial designs (the technical debt costs)

Way too many times we optimize for the initial costs (the general excuse is that familiarity delivers faster—with the more scientific forms: time to market is essential and premature optimization is the root of all evil), while ignoring almost completely the ongoing costs.

Original title and link: Considering TokuDB as an engine for timeseries data… or Cassandra or OpenTSDB (NoSQL database©myNoSQL)

via: http://www.mysqlperformanceblog.com/2013/08/29/considering-tokudb-as-an-engine-for-timeseries-data/


Big Data Debate: HBase or Cassandra

This debate about the pros and cons of HBase and Cassandra set up by Doug Henschen for InformationWeek and featuring Jonathan Ellis (Cassandra, DataStax) and Michael Hausenbias (MapR) will stir some strong feelings:

Michael Hausenbias: An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

Jonathan Ellis: The technical shortcomings driving HBase’s lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

✚ One question I couldn’t answer about this dialog is why HBase-side wasn’t covered by either a HBase community member or a user. Indeed MapR has interest in HBase, but their product is not HBase.

Original title and link: Big Data Debate: HBase or Cassandra (NoSQL database©myNoSQL)

via: http://www.informationweek.com/software/enterprise-applications/big-data-debate-will-hbase-become-domina/240159475?nomobile=1


$45millions more for DataStax

Holy cow! That’s a 4 followed by a 5… with no dots in between.

  1. Derrick Harris for GigaOm: NoSQL startup DataStax raises $45M to ride Cassandra’s wave:

    Cassandra’s success with such large users has to do with its ability to handle large-scale online applications that demand steady levels of performance, DataStax CEO Billy Bosworth told me. Scalability and performance have never been among Cassandra’s shortcomings, and the database is capable of replicating data across data centers. Large companies used to choose Oracle for applications that needed these capabilities, but now that NoSQL options are around and relatively mature, companies are rethinking whether the relational database model was ever really correct for some applications in the first place.

  2. Alex Williams for TC: DataStax Readies For IPO, Raises $45M For Modern Database Platform Suited To New Data Intensive World:

    DataStax will use the funding to build out globally and invest in Apache Cassandra, the NoSQL open-source project and foundation for the company’s database distributions. The funding also signals a potential IPO for DataStax but much will depend on the direction of the markets, said CEO Billy Bosworth in an interview yesterday. “We are building the company for that direction (IPO),” he said. “A l lot depends on external factors. Internally, the company is already starting that process.”

According to my books:

  1. This is the largest round raised by a NoSQL company. It tops 10gen’s $45mil for MongoDB.
  2. This is the 3rd largest round raised in the new data market, after Cloudera’s $65mil. and Hortonworks’s $50mil. rounds.

Original title and link: $45millions more for DataStax (NoSQL database©myNoSQL)


Get up and Running with Cassandra on Google Compute Engine

On the Google Cloud Platform blog:

The guide walks you through creating your nodes (instances), setting up Java, and creating and configuring a firewall. Included in the guide are several scripts that make the configuration and setup easy to understand and execute. Once you are finished with your cluster, a simple call to a teardown script cleans up your project’s environment.

Can you speculate why Cassandra is the first NoSQL database that gets mentioned on Google’s blog? (hint: maybe this?)

Original title and link: Get up and Running with Cassandra on Google Compute Engine (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2013/07/get-up-and-running-with-cassandra-on-google-compute-engine.html


How do you decide what database to use for what task?

Nathan Milford of Outbrain answering the question how do you decide what database to use for what task:

We look at how the data will be queried, its size, and how it needs to be distributed. We might use things like MySQL for historical reasons and MongoDB for smaller tasks, and then Cassandra for situations where data doesn’t all fit into memory or where it spans multiple machines and possibly data centers.

This is indeed the good recipe: data access model, data size, distribution model.

Original title and link: How do you decide what database to use for what task? (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/the-five-minute-interview-outbrain-touches-over-80-of-all-us-online-users-with-help-from-cassandra


Cassandra Summit’s Bests

If you haven’t been to Cassandra Summit 2013 or you missed some presentations, now you can (re)watch them on YouTube. Jonathan Ellis put together his list of favorites here and here.

I’m posting this on Saturday as there’re a lot of interesting talks and if Cassandra is on your radar it will take a couple of weekends to go through them.

Original title and link: Cassandra Summit’s Bests (NoSQL database©myNoSQL)


Best argument for official drivers

Jonathan Ellis:

More qualitatively but perhaps even more important, this addresses the paradox of choice we’ve had in the Cassandra Java world: multiple driver choices provide another barrier to newcomers, where each must evaluate the options for applicability to his project. Having just done such an evaluation to settle on Cassandra itself, this is the last thing they want to spend time on.

And that’s the best-case scenario. More often, a fragmented landscape leads to many solutions, each of which solve a different 80% of the problem. Better to have a single, well-thought-out solution, that lets people get started writing their application immediately.

This is the best argument ever for having official drivers.

✚ In the early days and over long time it’s quite difficult for a company to offer only official drivers. But there’s a solution for that too: recommend one. And support its maintainers.

Original title and link: Best argument for official drivers (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/the-native-cql-java-driver-goes-ga


Titan: Data Loading and Transactional Benchmark

The Aurelius team describing an advanced benchmark of Titan, a massive scale property graph allowing real-time traversals and updates, sponsored by Pearson, developed and run over 5 months:

The 10 terabyte, 121 billion edge graph was loaded into the cluster in 1.48 days at a rate of approximately 1.2 million edges a second with 0 failed transactions. These numbers were possible due to new developments in Titan 0.3.0 whereby graph partitioning is achieved using a domain-basedbyte order partitioner.

✚ The answer to why Titan is built on Cassandra can be found in this interview between Aurelius CTO Matthias Broecheler and DataStax co-founder Matt Pfeil:

[…] we don’t have to worry about things like replication, backup, and snap shots because all of that stuff is handled by Cassandra. We really just focus on: “How do you distribute a graph?”, “How do you represent a graph efficiently in a big table model?”, “How do you do things like etched compression and other things that are very graph specific in order to make the database fast? And, lastly, “How do to build intelligence index structures so that the graphs traversals, which are the core of any graph database, so that those are as fast as possible?”

Original title and link: Titan: Data Loading and Transactional Benchmark (NoSQL database©myNoSQL)

via: http://www.planetcassandra.org/blog/post/educating-the-planet-with-pearson


HBase migration to the new Hadoop Metrics2 system

Elliott Clarke explains a bit the work that his doing in migrating the HBase metrics to Hadoop’s Metrics2 system:

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

The post is more about the specific implementation details than the wide range of metrics HBase already supports and how this new system would unify and allow extending it.

Original title and link: HBase migration to the new Hadoop Metrics2 system (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics