Vertica: All content tagged as Vertica in NoSQL databases and polyglot persistence
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- Productionize, act on the insight
- Rinse, repeat
Here is what I’ve jotted down during Vertica’s webinar Hadoop vs. RDBMS for Big Data Analytics: Why Choose?
- the webinar has focused on clarifying where and how Vertica and Hadoop fit in the Big Data space
- Vertica’s strenghts:
- support for SQL, extended SQL, and analytics making it interactive investigation of data
- storage space efficiency — I don’t think it’s correct to interpret Hadoop data redundancy as storage space inneficiency
- analytics SDK (allows customizing in-database analytic functions)
- ease of operating and maintenance (auto-tunning features)
- the following slide is pretty eloquent about Hadoop and Vertica being complementary solutions :
when covering a scenario for using both Hadoop and Vertica, they chose the ease one: Hadoop as ETL. It’s not that it’s not a good one, but it’s the only one databases vendors are using these days when speaking about integration with Hadoop.
other possible Hadoop + Vertica use cases:
- Filter, join, and aggregation in Vertica with intermediate results fed into MR jobs
- parallel import and export to HDFS
- Hadoop MapReduce for data transformation and Vertica for optimized storage and retrieval
- there will be a community edition of Vertica. It was announced in October for the end of 2011, but I don’t think it’s out yet
- there’s a GitHub repo for user defined extensions for Vertica
the following categorization of Big Data tools is interesting but feels in favor of Vertica which would be placed somewhere close to the center of the triangle
Original title and link: Vertica and Hadoop for Big Data ( ©myNoSQL)
- MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
- MySQL doesn’t work for ID generation and graph storage.
- MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
- Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.
In my summary of the talk I’ve noted:
- Use MySQL when it works, something else when not - fortunately MySQL often does work
- MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
- MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
- MySQL replication improvements like crash safe multi-threaded slave is what they need
But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.
The following other data solutions get mentioned in Jeremy’s talk:
- Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
- Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
- Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs.
Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.
I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:
- The ability to orchestrate execution of Hadoop related tasks (i.e., executing a Hive Query, Pig Script, or M/R job) as part of a broader IT workflow.
- The ability to setup dependencies, so if a step fails the job can branch down a recovery path or send a notification, or if it’s a success it goes on to subsequent dependent tasks. Likewise it supports initiating several tasks in parallel.
- New integration for Pig — so that developers have the ability to execute a Pig job from a PDI Job flow, integrate the execution of Pig jobs in broader IT workflows through PDI Jobs, take advantage of our out of the box scheduler, and so on.
The list of tools Pentaho 4 integrates with is quite long:
- a long list of traditional RDBMS
- analytics databases (Greenplum, Vertica, Netezza, Teradata, etc.)
- NoSQL databases (MongoDB, HBase, etc.)
- Hadoop variants
- LexisNexis HPCC
This is the world of polyglot persistence and hybrid data storage.
Original title and link: BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases ( ©myNoSQL)
Very interesting customer base numbers for Sybase IQ, Vertica, SAND Technology, Infobright published by Curt Monash—most are in the hundreds, except for Sybase IQ.
This got me thinking what numbers would NoSQL companies have—is any of them sharing such numbers?. I’d speculate that most of them are in the tens, with 10gen (MongoDB) leading the space with probably a couple of hundreds at best.
James Governor reporting from the HP CEO Leo Apotheker keynote at the HP Analyst Summit:
“traditional relational databases are becoming less and less relevant to the future stack”
Even if HP acquired the real-time analytics platform Vertica I haven’t heard of HP in the NoSQL space, so my first thought was this is just the usual attack on competitors.
But it could also express HP’s interest in getting into the NoSQL market. The games of speculations about HP’s acquisitions are open.