NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Redshift: All content tagged as Redshift in NoSQL databases and polyglot persistence

Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez

Hosted on amplab, the origin of Spark this benchmark compares Redshift, Hive, Shark, Impala, Stinger/Tez:

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP- like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitative and qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

More important than the results:

  1. the clear methodology
  2. and its reproducibility

Original title and link: Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez (NoSQL database©myNoSQL)


Moving product recommendations from Hadoop to Redshift saves us time and money

Our old relational data warehousing solution, Hive, was not performant enough for us to generate product recommendations in SQL in our configuration.

This right here describes the common theme across all “Redshift is so much faster and cheaper than Hive”: expect a relational data warehouse from a Hadoop and Hive. You tell me if that’s the right expectation.

Here are other similar “revelations”:

Original title and link: Moving product recommendations from Hadoop to Redshift saves us time and money (NoSQL database©myNoSQL)


Hadoop vs Redshift

This is how Yaniv Mor’s “Hadoop vs. Redshift” ends:

We have a tie! Huh!? Didn’t Hadoop win most of the rounds? Yes, it did, but Big Data’s superheroes are better off working together as a team rather than fighting. Turn on the Hadoop-Signal when you need relatively cheap data storage, batch processing of petabytes, or processing data in non-relational formats. Call out to red-caped Redshift for analytics, fast performance for terabytes, and an easier transition for your PostgreSQL team. As Airbnb concluded in their benchmark: “We don’t think Redshift is a replacement of the Hadoop family due to its limitations, but rather it is a very good complement to Hadoop for interactive analytics”. We Agree.

I’m wondering why wasting 1337 words for an apple-to-oranges comparison.

Original title and link: Hadoop vs Redshift (NoSQL database©myNoSQL)


Amazon Redshift Update

A couple of interesting points from Werner Vogels’s post about Amazon Redshift’s security:

  1. Amazon Redshift has over 1000 customers and adding new ones at a rate of 100/week. I’m not familiar with customer acquisition numbers in the data warehouse space, but this doesn’t look like ParAccel, at least in its Redshift incarnation, is failing
  2. Amazon Redshift positioning: “price, performance and simplicity”. I cannot see many companies being able to compete against this triplet.
  3. Amazon has reduced the cost of read operations from DynamoDB to 1/4 to make that data more accessible to Redshift

Original title and link: Amazon Redshift Update (NoSQL database©myNoSQL)


Amazon Web Services Annual Revenue Estimation

Over the weekend, Christopher Mims has published an article in which he derives a figure for Amazon Web Services’s annual revenue: $2.4 billions:

Amazon is famously reticent about sales figures, dribbling out clues without revealing actual numbers. But it appears the company has left enough hints to, finally, discern how much revenue it makes on its cloud computing business, known as Amazon Web Services, which provides the backbone for a growing portion of the internet: about $2.4 billion a year.

There’s no way to decompose this number into the revenue of each AWS solution. For the data space I’d be interested into:

  1. S3 revenues. This is the space Basho’s Riak CS competes into.

    After writing my first post about Riak CS, I’ve learned that in Japan, the same place where Riak CS is run by Yahoo! new cloud storage, Gemini Mobile Technologies has been offering to local ISPs a similar S3-service built on top of Cassandra.

  2. Redshift is pretty new and while I’m not aware of immediate competitors (what am I missing?), I don’t think it accounts for a significant part of this revenue. Even if some of the early users, like AirBnb, report getting very good performance and costs from it.

    Redshift is powered by ParAccell, which, over the weekend, has been acquired by Actian.

  3. Amazon Elastic MapReduce. This is another interesting space from which Microsoft wants a share with its Azure HDInsight developed in collaboration with Hortonworks.

    In this space there’s also MapR and Google Compute combination which seem to be extremely performant.

  4. Interestingly Amazon is making money also from some of the competitors of its Amazon Dynamo and RDS services. The advantage of owning the infrastructure.

Original title and link: Amazon Web Services Annual Revenue Estimation (NoSQL database©myNoSQL)

What Makes Amazon Redshift Faster Than Hive?

I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:

Some of the advantages you gain from massive scale and flexibility make it challenging to build a more performant query engine. The following outlines how various features (or lack of features) influences performance:

  1. data format
  2. task launch overhead (nb: this can be optimized in Hive/Hadoop)
  3. intermediate data materialization vs pipelining
  4. columnar data format
  5. columnar query engine
  6. faster S3 connection

Original title and link: What Makes Amazon Redshift Faster Than Hive? (NoSQL database©myNoSQL)


Redshift Performance & Cost at Airbnb

Henry Cai from AirBnb reports about their experiment and move from using Hive with Hadoop to Amazon Redshift:

As shown above the performance gain is pretty significant, and the cost saving is even more impressive: $13.60/hour versus $57/hour. This is hard to compare due to the different pricing models, but check out pricing here for more info. In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help.

If I’m not mistaking, this is the second story in the last week about the performance of Redshift. But here’s something I don’t understand (or I don’t see mentioned in this post):

  1. you use Hadoop to store your data. The reason is that 12 months ago, 6 months ago (and today) there is no other more cost effective and productive solution.
  2. in this time you learn about the data. You develop models and queries
  3. your analysts prefer SQL because that’s what makes them more productive
  4. you take the data, the knowledge you’ve built in this time, you craft it to fit into a columnar analytic database
  5. then you write that the columnar analytic-oriented database is more performant than using Hive over Hadoop

To me this feels like saying that you are more efficient in your mother tongue than in a foreign language. Or am I missing something?

Original title and link: Redshift Performance & Cost at Airbnb (NoSQL database©myNoSQL)

Amazon Preparing 'Disruptive' Big Data AWS Service?

Interesting speculation by The Register:

AWS already has the AWS Data Pipeline, which helps administrators schedule and shuttle data among various services, AWS Redshift for data warehousing which lets people store large quantities of data in the cloud and run queries on it, its NoSQL SSD-backed DynamoDB, and its Relational Database Service (RDS). So where does MADS fit?

The Reg’s take is that MADS will allow Amazon to build services that can net together the above components and help automate the passing of data among them. It may also become a standalone product in its own right, based on its similarities to the TransLattice and Google Spanner tech.

I almost never bet, but I’d say this could be Amazon’s Spanner.

Original title and link: Amazon Preparing ‘Disruptive’ Big Data AWS Service? (NoSQL database©myNoSQL)


Amazon Redshift - Now Broadly Available

Jeff Barr:

We announced Amazon Redshift, our fast and powerful, fully managed, petabyte-scale data warehouse service, late last year (see my earlier blog post for more info).


We’ve designed Amazon Redshift to be cost-effective, easy to use, and flexible.


  1. who is the ideal Redshift user? I assume it should be AWS users that already have data in the Amazon cloud. Otherwise I have a bit of a hard time imagining trucks carrying tons of hard drives into Amazon data centers.
  2. what happens if for some reason you decide to move your data our of Redshift? How would that work?
  3. what is the next move and counter-argument of Greenplum, Netezza, Vertica, etc. to Redshift?

Original title and link: Amazon Redshift - Now Broadly Available (NoSQL database©myNoSQL)