NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Hadapt: All content tagged as Hadapt in NoSQL databases and polyglot persistence

Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H

Daniel Abadi in an email to Curt Monash analyzing a the Microsoft Polybase paper1:

The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

It’s a long paragraph, but the difference Daniel Abadi is emphasizing is critical: “Hadoop/HDFS data that could theoretically have been stored in DBMS all along”.

  1. According to Microsoft GraySystemsLab page on Polybase

    […] the goal of the Polybase project is to allow SQL Server PDW users to execute queries against data stored in Hadoop, specifically the Hadoop distributed file system (HDFS). Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance.

    And here’re my (very) brief thoughts about Polybase when I first learned about it.

Original title and link: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H (NoSQL database©myNoSQL)


Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary

This post by Daniel Abadi dates from July 2012, but it’s so right that I wonder why I didn’t link to it before:

Many people don’t realize that Hadoop and parallel relational databases have an extremely similar design. Both are capable of storing large data sets by breaking the data into pieces and storing them on multiple independent (“shared-nothing”) machines in a cluster. Both scale processing over these large data sets by parallelizing the processing of the data over these independent machines. Both do as much independent processing as possible across individual partitions of data, in order to reduce the amount of data that must be exchanged between machines. Both store data redundantly in order to increase fault tolerance.  The algorithms for scaling operations like selecting data, projecting data, grouping data, aggregating data, sorting data, and even joining data are the same. If you squint, the basic data processing technology of Hadoop and parallel database systems are identical.

There is absolutely no technical reason why there needs to be two separate systems doing the exact same type of parallel processing.

Someone trying to defend his position could counter this perspective by mentioning that Daniel Abadi’s product Hadapt is competing on this exact segment of the market. But by doing so, that someone would just prove how right Abadi is.

Original title and link: Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary (NoSQL database©myNoSQL)


The NoSQL Community Threw Out the Baby With the Bath Water

Monte Zweben interviewed by Gery Menegaz for ZDNet:

“The NoSQL community threw out the baby with the bath water. They got it right with flexible schemas and distributed, auto-sharded architectures, but it was a mistake to discard SQL,”

After reading the above, I was ready to file this together with The Time for NoSQL Standards Is Now; which by the way dates from July 2012.

Then it is only the final paragraph that clarifies what Monte Zweben is talking about: a SQL engine for Hadoop. I don’t know if the order makes any difference, but before taking about thrown babies:

  1. this is not SQL for NoSQL databases
  2. there’s already Hive for SQL on Hadoop
  3. there’s already Spire from Drawn to Scale for ANSI SQL on Hadoop
  4. there’s already Hadapt for SQL and Hadoop

Original title and link: The NoSQL Community Threw Out the Baby With the Bath Water (NoSQL database©myNoSQL)


Big Data Investment Network Map

Very interesting visualization of some of the companies in the Big Data market connected through their venture capital and investment firms by Benedikt Koehler and Joerg Blumtritt over Beautiful Data blog:

Big Data Investment Network Map

Click to see larger size

There’s only one company I couldn’t find on this map: Hortonworks.

Original title and link: Big Data Investment Network Map (NoSQL database©myNoSQL)

12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)

Improving Hadoop Performance by (Up To) 1000x

LinkedIn’s Adam Silberstein and Daniel Tunkelang provide a fantastic summary of a presentation I wish I could attend: Daniel Abadi’s “Improving Hadoop Performance by (up to) 1000x”.

Overly simplified, Daniel Abadi’s proposal is to create an analytical platform by using the best of two worlds: Hadoop and row-based or column-based relational database storage and query engines.

Hadapt, the company founded by Daniel Abadi, is in my list of the 8 most interesting companies for Hadoop’s future because I think that an interesting product can be built by combining the long optimized and tested storage and query engines of relational databases with Hadoop’s fault tolerance, scalability, and power, topped with a resource management layer.

Original title and link: Improving Hadoop Performance by (Up To) 1000x (NoSQL database©myNoSQL)


8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)

Postgres Plus Connector for Hadoop in Private Beta

Not much information available yet on the project page but looks like bidirectional integration of PosgreSQL and Hadoop.

The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters.  Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

Posgres Plus Hadoop

When speaking about PostgreSQL and Hadoop, the first thing that comes to my mind is Daniel Abadi’s HadoopDB that became not long ago the technology behind his startup which has already raised $9.5mil.

Original title and link: Postgres Plus Connector for Hadoop in Private Beta (NoSQL database©myNoSQL)

Hadapt Raises $9.5m Series a Financing

Speaking of funding, Hadapt, the company founded by Daniel Abadi, announced that it has closed a $9.5 million Series A round of financing led by Norwest Venture Partners (NVP) and Bessemer Venture Partners.

I haven’t heard much of Hadapt since the initial announcement, so I hope things will change.

Original title and link: Hadapt Raises $9.5m Series a Financing (NoSQL database©myNoSQL)


Hadapt and Why I'm doing a start-up pre-tenure

Daniel Abadi’s[1] reasons for backing his HadoopDB research with the Hadapt startup:

If it wasn’t for the fact that I spent the majority of the last decade soaking up the wisdom of Mike Stonebraker, I might not have chosen option (3). But I watched as my PhD thesis on C-Store was commercialized by Vertica (which was sold last month to HP), and another one of my research projects (H-Store) was commercialized by VoltDB. Thanks to Stonebraker and the first-class engineers at Vertica, I can claim that my PhD research is in use today by Groupon, Verizon, Twitter, Zynga, and hundreds of other businesses. When I come up for tenure, I want to be able to make similar claims about my research at Yale on HadoopDB. So I’m taking the biggest gamble of my career to see that happen.

While this subject would fit better in an entrepreneurship or startups blog, I felt Daniel’s decision reflects the passion of the people involved in the NoSQL and BigData space.


  1. Daniel Abadi: Assistant Professor of Computer Science at Yale University, Chief Scientist and Co-founder Hadapt, @daniel_abadi  

Original title and link: Hadapt and Why I’m doing a start-up pre-tenure (NoSQL databases © myNoSQL)