NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



SQL: All content tagged as SQL in NoSQL databases and polyglot persistence

How SQL-on-JSON analytics bolstered a business

Alex Woodie (Datanami) reporting about BitYota a SQL-based data warehouse on top of JSON:

BitYota says it designed its own hosted data warehouse from scratch, and that it’s differentiated by having a JSON access layer atop the data store. “We have some uniqueness where we operate SQL directly on JSON,” says BitYota CEO Dev Patel. “We don’t need to translate that data into a structured format like a CSV. We believe that if you transform the data, you will lose some of the data quality. And once that’s transformed, you won’t get it back.”

✚ BitYota’s tagline is Analytics for mongoDB, so I assume it’s safe to say the backend is mongoDB and they are building a SQL layer on top of it. What flavor and what’s the behavior for SQL’s quirks would be a very interesting story.

✚ This related to my earlier Do all roads lead back to SQL?

Original title and link: How SQL-on-JSON analytics bolstered a business (NoSQL database©myNoSQL)


Do all roads lead back to SQL? Some might and some might not

Seth Proctor for Dr.Dobb’s:

Increasingly, NewSQL systems are showing scale, schema flexibility, and ease of use. Interestingly, many NoSQL and analytic systems are now putting limited transactional support or richer query languages into their roadmaps in a move to fill in the gaps around ACID and declarative programming. What that means for the evolution of these systems is yet to be seen, but clearly, the appeal of Codd’s model is as strong as ever 43 years later.

Spend a bit of time reading (really reading) the above paragraph—there are quite a few different concepts put together to make the point of the article.

SQL is indeed getting closer to the NoSQL databases, but mostly to Hadoop. I still stand by my thoughts in The premature return to SQL.

Most NoSQL databases already offer some limited ACID guarantees. And some flavors of transactions are supported or are being added. But only as long as the core principles can still be guaranteed or the trade-offs are made obvious and offered as clear choices to application developers.

The relational model stays with the relational databases. If some of its principles can be applied (e.g. data type integrity, optional schema enforcement), I see nothing wrong with supporting them. Good technical solutions know both what is needed and what is possible.

Original title and link: Do All Roads Lead Back to SQL? | Dr Dobb’s (NoSQL database©myNoSQL)


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.


Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)


InfiniSQL - How to make an infinitely scalable relational database hosted a guest post from the author of InfiniSQL:

Benchmarking shows that an InfiniSQL cluster can handle over 500,000 complex transactions per second with over 100,000 simultaneous connections, all on twelve small servers. The methods used to test are documented, and the code is all available so that any practitioner can achieve similar results. There are two main characteristics which make InfiniSQL extraordinary:

  1. It performs transactions with records on multiple nodes better than any clustered/distributed RDBMS
  2. It is free, open source. Not just a teaser “community” version with the good stuff proprietary. The community version of InfiniSQL will also be the enterprise version, when it is ready.

Tell me how fast can you find in the documentation of InfiniSQL that it is memory-only1.

  1. There is nothing wrong with being memory-only. What’s wrong is talking about the speed without mentioning anything about the storage until chapter 3

Original title and link: InfiniSQL - How to make an infinitely scalable relational database (NoSQL database©myNoSQL)


Hive Cheat Sheet for SQL Users

Nice resource for people familiar with SQL looking into Hive:

Simple Hive Cheat Sheet for SQL Users

Original title and link: Hive Cheat Sheet for SQL Users (NoSQL database©myNoSQL)


SQL JOINs visualized

The best visualization of JOINS by C.L. Moffatt:


Original title and link: SQL JOINs visualized (NoSQL database©myNoSQL)

How Would We Query Such a Database Without Wasting Time With Ugly SQL?

How would we query such a database without wasting time with ugly SQL? We would need an API that will let us define our table schema and then allow us to craft queries using simple abstractions like collection maps, filter, joins, etc. I don’t mean a heavyweight ORM solution either. If we are after simplicity, we’d better forgo dealing with object mappings and the complexity they bring. All we want is a hassle-free way to model our data and read and write it.

After reading about this paragraph, I thought: “what a wonderful description of RethinkDB’s data querying language”. Then I switched back to reading the article which is about SQLAlchemy, one of the most interesting and complete ORMs.

Original title and link: How Would We Query Such a Database Without Wasting Time With Ugly SQL? (NoSQL database©myNoSQL)


Paper: YSmart - Yet Another SQL-to-MapReduce Translator

Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.

A Brief Guide to Pig Latin for the SQL Guy

Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:

Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.

Original title and link: A Brief Guide to Pig Latin for the SQL Guy (NoSQL database©myNoSQL)


Playing With Hadoop Pig

Anything missing from Pig?

[…] the following SQL operations can be translated as follows. We put the order in which the operations have to be run between parenthesis.

  • SELECT id, name: resultData = FOREACH limitData GENERATE id, name
  • FROM Table: data = LOAD ‘person.csv’ USING PigStorage(‘,’) AS (id:int, name:chararray, age:int)
  • WHERE a=1: filteredData = FILTER data BY a=1
  • ORDER BY age DESC: orderedData = ORDER filteredData BY age DESC
  • LIMIT 10: limitData = LIMIT orderedData 10

One can also use left join and join as follows:

  • JOIN: join_data: JOIN data1 BY id1, data2 BY id2
  • LEFT JOIN: left_join_data = JOIN data1 BY id1 LEFT OUTER, data2 BY id2

Original title and link: Playing With Hadoop Pig (NoSQL database©myNoSQL)


SQL Over HBase With Phoenix

Released by the Salesforce team, Phoenix adds a SQL layer on top of HBase and an almost complete JDBC driver.

Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

The project already has a page about the performance and the results are looking great. For a bullet list summary, check out James Taylor’s post.

Original title and link: SQL Over HBase With Phoenix (NoSQL database©myNoSQL)