This is how Yaniv Mor’s “Hadoop vs. Redshift” ends:
We have a tie! Huh!? Didn’t Hadoop win most of the rounds? Yes, it did, but
Big Data’s superheroes are better off working together as a team rather than
fighting. Turn on the Hadoop-Signal when you need relatively cheap data
storage, batch processing of petabytes, or processing data in non-relational
formats. Call out to red-caped Redshift for analytics, fast performance for
terabytes, and an easier transition for your PostgreSQL team. As Airbnb
concluded in their benchmark: “We don’t think Redshift is a replacement of
the Hadoop family due to its limitations, but rather it is a very good
complement to Hadoop for interactive analytics”. We Agree.
I’m wondering why wasting 1337 words for an apple-to-oranges comparison.
Original title and link: Hadoop vs Redshift
A couple of interesting points from Werner Vogels’s post about Amazon Redshift’s security:
- Amazon Redshift has over 1000 customers and adding new ones at a rate of 100/week. I’m not familiar with customer acquisition numbers in the data warehouse space, but this doesn’t look like ParAccel, at least in its Redshift incarnation, is failing
- Amazon Redshift positioning: “price, performance and simplicity”. I cannot see many companies being able to compete against this triplet.
- Amazon has reduced the cost of read operations from DynamoDB to 1/4 to make that data more accessible to Redshift
Original title and link: Amazon Redshift Update ( ©myNoSQL)
Over the weekend, Christopher Mims has published an article in which he derives a figure for Amazon Web Services’s annual revenue: $2.4 billions:
Amazon is famously reticent about sales figures, dribbling out clues without
revealing actual numbers. But it appears the company has left enough hints
to, finally, discern how much revenue it makes on its cloud computing
business, known as Amazon Web Services, which provides the backbone for a
growing portion of the internet: about $2.4 billion a year.
There’s no way to decompose this number into the revenue of each AWS solution. For the data space I’d be interested into:
S3 revenues. This is the space Basho’s Riak CS competes into.
After writing my first post about Riak CS, I’ve learned that in Japan, the same place where Riak CS is run by Yahoo! new cloud storage, Gemini Mobile Technologies has been offering to local ISPs a similar S3-service built on top of Cassandra.
Redshift is pretty new and while I’m not aware of immediate competitors (what am I missing?), I don’t think it accounts for a significant part of this revenue. Even if some of the early users, like AirBnb, report getting very good performance and costs from it.
Redshift is powered by ParAccell, which, over the weekend, has been acquired by Actian.
Amazon Elastic MapReduce. This is another interesting space from which Microsoft wants a share with its Azure HDInsight developed in collaboration with Hortonworks.
In this space there’s also MapR and Google Compute combination which seem to be extremely performant.
Interestingly Amazon is making money also from some of the competitors of its Amazon Dynamo and RDS services. The advantage of owning the infrastructure.
Original title and link: Amazon Web Services Annual Revenue Estimation ( ©myNoSQL)
I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:
Some of the advantages you gain from massive scale and flexibility make it
challenging to build a more performant query engine. The following outlines how various features (or lack of
features) influences performance:
- data format
- task launch overhead (nb: this can be optimized in Hive/Hadoop)
- intermediate data materialization vs pipelining
- columnar data format
- columnar query engine
- faster S3 connection
Original title and link: What Makes Amazon Redshift Faster Than Hive? ( ©myNoSQL)
Henry Cai from AirBnb reports about their experiment and move from using Hive with Hadoop to Amazon Redshift:
As shown above the performance gain is pretty significant, and the cost
saving is even more impressive: $13.60/hour versus $57/hour. This is hard to
compare due to the different pricing models, but check out pricing here for
more info. In fact, our analysts like Redshift so much that they don’t want
to go back to Hive and other tools even though a few key features are
lacking in Redshift. Also, we have noticed that big joins of billions of
rows tend to run for a very long time, so for that we’d go back to hadoop
If I’m not mistaking, this is the second story in the last week about the performance of Redshift. But here’s something I don’t understand (or I don’t see mentioned in this post):
- you use Hadoop to store your data. The reason is that 12 months ago, 6 months ago (and today) there is no other more cost effective and productive solution.
- in this time you learn about the data. You develop models and queries
- your analysts prefer SQL because that’s what makes them more productive
- you take the data, the knowledge you’ve built in this time, you craft it to fit into a columnar analytic database
- then you write that the columnar analytic-oriented database is more performant than using Hive over Hadoop
To me this feels like saying that you are more efficient in your mother tongue than in a foreign language. Or am I missing something?
Original title and link: Redshift Performance & Cost at Airbnb ( ©myNoSQL)
Interesting speculation by The Register:
AWS already has the AWS Data Pipeline, which helps administrators
schedule and shuttle data among various services, AWS Redshift for
data warehousing which lets people store large quantities of data in
the cloud and run queries on it, its NoSQL SSD-backed DynamoDB, and
its Relational Database Service (RDS). So where does MADS fit?
The Reg’s take is that MADS will allow Amazon to build services that
can net together the above components and help automate the passing
of data among them. It may also become a standalone product in its
own right, based on its similarities to the TransLattice and Google
I almost never bet, but I’d say this could be Amazon’s Spanner.
Original title and link: Amazon Preparing ‘Disruptive’ Big Data AWS Service? ( ©myNoSQL)
We announced Amazon Redshift, our fast and powerful, fully managed,
petabyte-scale data warehouse service, late last year (see my
earlier blog post for more info).
We’ve designed Amazon Redshift to be cost-effective, easy to use,
- who is the ideal Redshift user? I assume it should be AWS users that already have data in the Amazon cloud. Otherwise I have a bit of a hard time imagining trucks carrying tons of hard drives into Amazon data centers.
- what happens if for some reason you decide to move your data our of Redshift? How would that work?
- what is the next move and counter-argument of Greenplum, Netezza, Vertica, etc. to Redshift?
Original title and link: Amazon Redshift - Now Broadly Available ( ©myNoSQL)