ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Project Savanna: Hadoop and OpenStack

Timothy Prickett Morgan for The Register about Project Savanna, a collaboration between Mirantis, Hortonworks, and Red Hat:

Batman and Robin. Peanut butter and chocolate. OpenStack and Hadoop. These are things that go together, with the latter pairing being something that commercial OpenStack distie Mirantis, commercial Hadoop distie Hortonworks, and commercial KVM and Linux distie (and soon to be OpenStack commercializer) Red Hat are putting together under a new OpenStack effort dubbed Project Savanna.

Hadoop is at the age where everyone tries to package it and claim they’ll be the Red Hat of the Hadoop ecosystem. I cannot really dot the i-s and cross the t-s, but my gut feeling is that right now all these are actually more similar to the attempts of bringing Linux to the desktop.

We know how successful these have been so far.

Original title and link: Project Savanna: Hadoop and OpenStack (NoSQL database©myNoSQL)

via: http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/


Boundary for Splunk app for correlating alerts

Alex Williams for TechCrunch:

Boundary‘s application performance monitoring technology is now integrated into Splunk‘s enterprise platform, providing a window into apps that increasingly are distributed across cloud and on-premise virtualized environments.

At first I thought this means Boundary will use Splunk as the backend for the data. But Boundary is a service so that’s not the case. Plus Splunk can already be used for network management and monitoring.

According to the post, “Splunk real-time alerts are tagged as annotations in Boundary’s time-series graphs. Customers can then correlate alerts against application flow and performance data.” So basically this is monitoring your monitoring system, right?

Original title and link: Boundary for Splunk app for correlating alerts (NoSQL database©myNoSQL)

via: http://techcrunch.com/2013/04/25/new-boundary-app-for-splunk-predicts-root-cause-of-app-brownouts/


Project Falcon: Tackling Hadoop Data Lifecycle Management

Venkatesh Seetharam announcing a new Apache incubating project in the Hadoop ecosystem open sourced by InMobi and Hortonworks:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

I think this diagram describes Project Falcon best:

Project Falcon at a Glance

✚ Was there any other project addressing this space?

Original title and link: Project Falcon: Tackling Hadoop Data Lifecycle Management (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/


3 Big Data Use Cases in Banking

An article on Sys-Con about 3 high level and generic use cases of Big Data in banking:

  1. Customer experience
  2. Risk management
  3. Operations optimization

The first and the third are common across multiple fields. Risk management is critical to banks’ core business and I assume this is the domain where most of the technology investment happens.

Original title and link: 3 Big Data Use Cases in Banking (NoSQL database©myNoSQL)

via: http://bigdata.sys-con.com/node/2623407/print


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html


Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility

Ofer Mendelevitch for Hortonworks blog:

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

Most often when speaking about Hadoop, people refer to costs (commodity servers), parallelism and scalability. I do not remember how many times I’ve written that the main difference between Hadoop and traditional data warehouses is in the agility it offers.

One Hadoop tagline could be: “collect data today. analyse it when and how you want“.

Original title and link: Schema on Writes vs Schema on Reads - Apache Hadoop and Data Agility (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hadoop-data-agility/


A Walk Down the Memory Lane of Big Data: Inside IBM’s SAGE, the Largest Computer Ever Built

Fascinating data about the Semi-Automatic Ground Environment system built by IBM in1957:

SAGE consisted of 20 or so Direction Centers, each of which was a windowless, one-acre-large concrete cube (see below). Inside each DC were two CPUs, each one measuring 7,500 sq ft and consisting of 60,000 vacuum tubes, 175,000 diodes, 13,000 newfangled transistors, and 256KB of magnetic core RAM, consuming a total of 3MW of power and weighing in at 250 tons. Each CPU — only one operated at a time; the other was kept as a hot spare to minimize downtime — was capable of executing 75,000 instructions per second, which was enough to spit out tons of radar data to 150 CRT consoles.

Original title and link: A Walk Down the Memory Lane of Big Data: Inside IBM’s SAGE, the Largest Computer Ever Built (NoSQL database©myNoSQL)

via: http://www.extremetech.com/computing/151980-inside-ibms-67-billion-sage-the-largest-computer-ever-built


Some Interesting Talks and a Panel With People From Cloudera, Platfora, Greylock and Think Big Analytics

Recorded at New York Data Business Meetup, gathered by Matt Turck and featuring separately and together in a panel:

  • Jeff Hammerbacher from Cloudera
  • DJ Patil from Greylock
  • Ron Bodkin from Think Big Analytics
  • Ben Werther from Platfora.


IBM Accelerates Its Big Data Portfolio

Jeff Kelly takes a look at IBM’s data solutions portfolio:

IBM has the broadest and deepest Big Data product and services portfolio in the industry, as well as the market leading revenue to show for it. But IBM’s greatest asset also lies at the heart of its biggest challenge. With such a diverse set of Big Data capabilities, IBM has struggled to unify them into distinct, compelling offerings. How IBM responds to the challenge of bringing together such a broad and deep set of technologies and services - many the result of $16 billion worth of analytics-related acquisitions since 2005 - into consumable and effective product offerings will largely determine the company’s success (or failure) in the Big Data space and will have major implications for enterprise CIOs.

There are two things that I’m not sure I understand:

  1. is it a known strategy leading to more sales to have a confusing portfolio of products?

    Basically you offer so many products that a customer will be so confused that he’ll have to hire your consultant to make the buying recommendation decision.

  2. when ranking companies by sales, wouldn’t make more sense to compare revenue/employee than raw numbers?

    Which company is better? A company with 2 sales people generating $1mil in revenue or a company with 100 sales people and 100 consultants generating $20mil?

Original title and link: IBM Accelerates Its Big Data Portfolio (NoSQL database©myNoSQL)

via: http://wikibon.org/wiki/v/IBM_Accelerates_Its_Big_Data_Portfolio


Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler

Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).

✚ Datanami has published a summary of the keynote here

Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (NoSQL database©myNoSQL)


Best NoSQL April’s Fool

I know a few people that avoid the Internet completely on April’s Fool. After being tricked every year by my dad, I’m very careful with what I’m posting on that day. This year has been easy on me, but that doesn’t mean there weren’t a couple of good ones.

My favorites:

Original title and link: Best NoSQL April’s Fool (NoSQL database©myNoSQL)


Hadoop Security Design Paper

Speaking about the buzz around Dataguise’s field-level encryption for Apache Hadoop and their 10 best practices for securing sensitive data in Hadoop, after the break1, you can find the “Hadoop Security Design” paper written by a team at Yahoo.