NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Vertica: All content tagged as Vertica in NoSQL databases and polyglot persistence

Benchmarking graph databases... with unexpected results

A team from MIT CSAIL set up to benchmark a graph database and 3 relational databases with different models: row-based (MySQL), in-memory (VoltDB), and column-based (Vertica) . The results are interesting, to say the least:

We can see that relational databases outperform Neo4j on PageRank by up to two orders of magnitude. This is because PageRank involves full scanning and joining of the nodes and edges table, something that relational databases are very good at doing. Finding Shortest Paths involves starting from a source node and successively exploring its outgoing edges, a very different access pattern from PageRank. Still, we see from Figure 1(b) that relational databases match or outperform Neo4j in most cases. In fact, Vertica is more than twice faster than Neo4j. The only exception is VoltDB over Twitter dataset.

Being beaten at your own game is not a good thing. I hope this is just a fluke in the benchmark (misconfiguration) or a result particular to those data sets.

Original title and link: Benchmarking graph databases… with unexpected results (NoSQL database©myNoSQL)


Scaling Big Data Mining Infrastructure at Twitter

I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:

DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”

and then the reality check:

  1. Your boss says something vague
  2. You think very hard on how to move the needle
  3. Where’s the data?
  4. What’s in this dataset?
  5. What’s all the f#$#$ crap in the data?
  6. Clean the data
  7. Run some off-the-shelf data mining algorithm
  8. Productionize, act on the insight
  9. Rinse, repeat


Counting Triangles Smarter (Or How to Beat Big Data Vendors at Their Own Game)

Davy Suvee showing that Datablend’s custom datastore could deliver better performance than generic solutions like Hadoop, Vertica, or ExaData:

Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).

Considering the size of the data (86mil. relationships), I wonder what the result would be using a graph database like Neo4j. Anyone up for testing it?

Original title and link: Counting Triangles Smarter (Or How to Beat Big Data Vendors at Their Own Game) (NoSQL database©myNoSQL)


Hadoop and Vertica: Using the Right Tools for Managing the Obama Campaign

Dan Woods in a post titled “How Vertica was the star of the Obama campaign, and other revelations” (nb: I feel my title is more accurate though) detailing how the technical team behind Obama campaign used a combination of Hadoop and Vertica to handle data:

The Obama campaign did have Hadoop running in the background, doing the noble work of aggregating huge amounts of data, but the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed.

If you are a vendor you’d probably emphasize the importance of your tool. And the closest to the end user it is the easier it is to do that. But we all know that all components of a large system are critical. Take one out and you’re left without the possibility to ingest, or process, or store, or present data. Pretty much a useless system.

Original title and link: Hadoop and Vertica: Using the Right Tools for Managing the Obama Campaign (NoSQL database©myNoSQL)


Hadoop: Answering the Basic Questions: Why, What, How, Where

A post on the HP blog answering 5 questions about Hadoop:

  1. Why Hadoop?
  2. What is Hadoop?
  3. What does it do?
  4. What is it good for?
  5. What’s the future?

With the current momentum behind Hadoop, there’s no question that it’s here to stay. But it’s best to think of Hadoop as a starting point for interesting developments to come.

As you probably know, HP owns Vertica, a column-oriented SQL database for data warehousing and BI, but when talking about Big Data HP includes both Vertica and Hadoop.

Original title and link: Hadoop: Answering the Basic Questions: Why, What, How, Where (NoSQL database©myNoSQL)


Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)


Vertica and Hadoop for Big Data

Here is what I’ve jotted down during Vertica’s webinar Hadoop vs. RDBMS for Big Data Analytics: Why Choose?

  • the webinar has focused on clarifying where and how Vertica and Hadoop fit in the Big Data space
  • Vertica’s strenghts:
    • support for SQL, extended SQL, and analytics making it interactive investigation of data
    • storage space efficiency — I don’t think it’s correct to interpret Hadoop data redundancy as storage space inneficiency
    • analytics SDK (allows customizing in-database analytic functions)
    • ease of operating and maintenance (auto-tunning features)
  • the following slide is pretty eloquent about Hadoop and Vertica being complementary solutions : Vertica vs Hadoop - Analytics Feature Comparison
  • when covering a scenario for using both Hadoop and Vertica, they chose the ease one: Hadoop as ETL. It’s not that it’s not a good one, but it’s the only one databases vendors are using these days when speaking about integration with Hadoop.

    Hadoop + Vertica Use Case Example

  • other possible Hadoop + Vertica use cases:

    • Filter, join, and aggregation in Vertica with intermediate results fed into MR jobs
    • parallel import and export to HDFS
    • Hadoop MapReduce for data transformation and Vertica for optimized storage and retrieval
  • there will be a community edition of Vertica. It was announced in October for the end of 2011, but I don’t think it’s out yet
  • there’s a GitHub repo for user defined extensions for Vertica
  • the following categorization of Big Data tools is interesting but feels in favor of Vertica which would be placed somewhere close to the center of the triangle

    Triangle of Big Data Tools

Original title and link: Vertica and Hadoop for Big Data (NoSQL database©myNoSQL)

MySQL at Twitter: Storing 250mil Tweets Daily

Todd Hoff took the time to disect and extract in a post the interesting bits from Jeremy Cole’s talk[1]Big and Small Data at @Twitter from the O’Reilly MySQL conference:

  • MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
  • MySQL doesn’t work for ID generation and graph storage.
  • MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
  • Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.

In my summary of the talk I’ve noted:

  • Use MySQL when it works, something else when not - fortunately MySQL often does work
  • MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
  • MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
  • MySQL replication improvements like crash safe multi-threaded slave is what they need

But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.

The following other data solutions get mentioned in Jeremy’s talk:

  • Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
  • Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
  • Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs. 

Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.

I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:

Hadoop vs PIG vs Vertica for Counting Triangles

Very interesting post on Vertica blog, comparing the solution of counting triangles using Hadoop, PIG, and Vertica. As you’d expect, Vertica shows the best results, but this is still a nice example of using different tools for solving a problem. Plus all code is available on GitHub.

PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG’s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.

Vertica’s performance wasn’t even close to that of Hadoop – thankfully. It was much much better. In fact Vertica ate PIG’s and Hadoop’s lunch – its best time is 22x faster than PIG’s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).

Here are a few key factors in Vertica’s performance advantage:

  • Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the “disk usage” study.
  • Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine
  • More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG

The conclusion is interesting too:

Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica’s performance advantage effectively equates to a 95% discount on hardware. Think about that. You’d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.

Original title and link: Hadoop vs PIG vs Vertica for Counting Triangles (NoSQL database©myNoSQL)


BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases


  • The ability to orchestrate execution of Hadoop related tasks (i.e., executing a Hive Query, Pig Script, or M/R job) as part of a broader IT workflow.
  • The ability to setup dependencies, so if a step fails the job can branch down a recovery path or send a notification, or if it’s a success it goes on to subsequent dependent tasks. Likewise it supports initiating several tasks in parallel.
  • New integration for Pig — so that developers have the ability to execute a Pig job from a PDI Job flow, integrate the execution of Pig jobs in broader IT workflows through PDI Jobs, take advantage of our out of the box scheduler, and so on.

The list of tools Pentaho 4 integrates with is quite long:

  • a long list of traditional RDBMS
  • analytics databases (Greenplum, Vertica, Netezza, Teradata, etc.)
  • NoSQL databases (MongoDB, HBase, etc.)
  • Hadoop variants
  • LexisNexis HPCC

This is the world of polyglot persistence and hybrid data storage.

Original title and link: BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases (NoSQL database©myNoSQL)

Columnar DBMS Vendor Customer Metrics

Very interesting customer base numbers for Sybase IQ, Vertica, SAND Technology, Infobright published by Curt Monash—most are in the hundreds, except for Sybase IQ.

This got me thinking what numbers would NoSQL companies have—is any of them sharing such numbers?. I’d speculate that most of them are in the tens, with 10gen (MongoDB) leading the space with probably a couple of hundreds at best.

Original title and link: Columnar DBMS Vendor Customer Metrics (NoSQL database©myNoSQL)

HP CEO about Relational Databases

James Governor reporting from the HP CEO Leo Apotheker keynote at the HP Analyst Summit:

“traditional relational databases are becoming less and less relevant to the future stack”

Even if HP acquired the real-time analytics platform Vertica I haven’t heard of HP in the NoSQL space, so my first thought was this is just the usual attack on competitors.

But it could also express HP’s interest in getting into the NoSQL market. The games of speculations about HP’s acquisitions are open.

  1. James Governor: Co-founder of RedMonk, @monkchips  

Original title and link: HP CEO about Relational Databases (NoSQL databases © myNoSQL)