NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Spark: All content tagged as Spark in NoSQL databases and polyglot persistence

Spark Summit 2014 roundup

I haven’t been at the Spark Summit and even if the complete event was streamed online, my agenda hasn’t allowed me to watch more than a couple keynotes. Thomas Dinsmore’s notes about the event were quite interesting to get an idea of what happened there.

One thing that caught my attention immediately:

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event. Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three- day affair.

Original title and link: Spark Summit 2014 roundup (NoSQL database©myNoSQL)


Spark for Data Science: A Case Study

A great practical intro to Apache Spark by Casey Stella of Hortonworks:

This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover

  • Data Gathering
  • Data Engineering
  • Data Analysis
  • Presentation of Results and Conclusions

Original title and link: Spark for Data Science: A Case Study (NoSQL database©myNoSQL)


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.


Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)


Using Spark for fast in-memory computing

Justin Kestelyn from Databricks describes the differences between Hadoop and Spark processing models in a post on “Cloudera’s blog“:

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. […] In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory.


✚ This looks in a way similar to the Cascading programming model combined with the capability of storing in memory the working dataset for the current computations.

Original title and link: Using Spark for fast in-memory computing (NoSQL database©myNoSQL)


Spark and Shark company Databricks raises $14M from Andreessen Horowitz

Spark and Shark getting wings:

A team of professors who has created the in-memory Spark and Shark platforms for analyzing big data has raised nearly $13.9 million to commercialize those products. The company is still in stealth mode, but it’s called Databricks and Andreessen Horowitz led the round. […] It also lists Databricks’ very impressive board of directors: Co-founder and CEO Ion Stoica (University of California, Berkeley professor and former co-founder and CEO of Conviva); Co-founder and CTO Matei Zaharia (MIT professor); Ben Horowitz (general partner at Andreessen Horowitz and former Opsware co-founder and CEO); and Scott Shenker (University of California, Berkeley professor and former Nicira co-founder and CEO).

You should have probably heard already of all these guys.

Original title and link: Spark and Shark company Databricks raises $14M from Andreessen Horowitz (NoSQL database©myNoSQL)


Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark

Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:

The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.

  1. Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 

  2. Scalding: A Scala API for Cascading 

  3. Scoobi: a Scala productivity framework for Hadoop 

  4. Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. 

  5. Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write 

  6. Scrunch: a Scala wrapper for Crunch 

  7. Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop  

Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (NoSQL database©myNoSQL)