NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



bigdata: All content about bigdata in NoSQL databases and polyglot persistence

Hortonworks raises $100M to grow engineering and company's ecosystem globally

Derrick Harris for GigaOm has the scoop:

Hadoop vendor Hortonworks has raised $100 million in a new round of venture capital led by BlackRock and Passport Capital. The company’s existing investors — Dragoneer, Tenaya Capital, Benchmark, Index Ventures and Yahoo — also participated in the latest round. Hortonworks CEO Rob Bearden said in an interview that the new funding will help Hortonworks scale its engineering efforts, grow the company’s ecosystem and scale its global operations.

Last week’s round E for Cloudera turned up to be $160 instead of the Bloomberg rumored $200.

These big rounds raised by the Hadoop pure-players are a confirmation of the Hadoop market. But I also think they can be explained by the tough competition Cloudera and Hortonworks are facing from large corporations like IBM, Teradata, Oracle, Microsoft. At least in terms of budget.

✚ While some of the above mentioned companies are partnering with at least one pure-play Hadooper — Cloudera, Hortonworks, MapR — that doesn’t mean they are not keeping an eye on the prize.

Original title and link: Hortonworks raises $100M to grow engineering and company’s ecosystem globally (NoSQL database©myNoSQL)


Examples of analytics applications across industries

A great matrix of the different analytics use cases across industries in Hortonworks’s post “Enterprise Hadoop and the Journey to a Data Lake“:

Anaylitcs use cases

The data type column section covers multiple dimensions of data. And the authors took a conservative approach for the structured and unstructured categories (in the sense that they marked very few categories as unstructured).

A couple of interesting exercises that can be done using this matrix as an input:

  1. figure out how adding data from different categories to a specific use case would benefit it. One obvious example is: how would Telecom companies benefit from adding to their infrastructure analysis social data?

    Building on the above, decide what tools exist to help with this extra scenario.

  2. can one use case from an industry be applied to a different industry to disrupt it?

    What would be the quickest road to accomplish it?

Original title and link: Examples of analytics applications across industries (NoSQL database©myNoSQL)

Big doubts on big data: Why I won't be sharing my medical data with anyone

Jo Best (ZDNet) talking about the privacy concerns of having centralized, non-regulated, non-anonymised healthcare data:

If ever there was an open goal for big data, healthcare should be it.

By gathering information from doctors, patients, drug companies, insurers, and charities, and putting the big data machinery to work on analysing it, we should be able to get better insights into a range of conditions and then come up with better ways to treat them.

I’m happy I’m not the only one concerned about all these.

Original title and link: Big doubts on big data: Why I won’t be sharing my medical data with anyone (NoSQL database©myNoSQL)


MapR product strategy

Maria Deutscher (SiliconAngle) quoting MapR CMO Jack Norris:

The MapR strategy centers on what chief marketing officer Jack Norris described in an interview as a “proven business model of really focusing on a product, selling a product, making a product enterprise grade, utilizing the innovations of the community but providing some [additional] advantages so customers can be even more successful.”

I thought that a part of a proven business is innovating on the product and less so utilizing the innovations of the community. Or at least finding some ways to paying back for those community innovations.

Original title and link: MapR product strategy (NoSQL database©myNoSQL)


Stranger in a strange land: HPC and Big Data

Paul Mineiro sharing his notes and thoughts after attending an HPC event:

My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They’re aware of what we’re up to, but they talk about Hadoop like it’s some giant livin’ in the hillside, comin down to visit the townspeople. Listening to them mapping what we’re up to into their conceptual landscape was very enlightening, and helped me understand them better.

No more ivory towers.

Original title and link: Stranger in a Strange Land: HPC and Big Data (NoSQL database©myNoSQL)


What can Big Data mean to healthcare?

Eric Baldeschwieler:

If you had such a system, learning and hypothesis testing in medicine would accelerate tremendously! Some examples: You could do quality of care of outcome metrics on all procedures over time and ID best practices; You could quickly look for interesting correlations between infected patents to ID causes of medical conditions or at least correlations; You could datamine all reported illnesses against all unusual gene sequences; You could correlate activity, demographics, diet with outcomes and ID best practices; Etc, Etc, Etc. You would see an amazing acceleration of learning, quality of care, patent specific therapies, …

Unfortunately on the receiving side of all these extremely private data are entities that, at least in some parts of the world, don’t have the best reputation and are very focused on their own profit.

Original title and link: What can Big Data mean to healthcare? (NoSQL database©myNoSQL)


Pig cheat sheet

Cheat sheet? Check. Pig? Check. Where do I get it?


Some of my favorite data visualization resources

Pretty much everything that contains the words visualization and data in the title is getting my attention. Moreover so if it promises a list of resources that could help me learn a bit of the art of visualization. Aaron Cordova’s list contains books, sites, and tools:

Visualization is more art than science at this point, although some have used it enough to be able to identify successful techniques for various purposes. Most successful visualizations are perhaps more dependent upon the decision or task at hand than the actual original structure of the data.

Original title and link: Some of my favorite data visualization resources (NoSQL database©myNoSQL)


The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)


Hadoop vs Redshift

This is how Yaniv Mor’s “Hadoop vs. Redshift” ends:

We have a tie! Huh!? Didn’t Hadoop win most of the rounds? Yes, it did, but Big Data’s superheroes are better off working together as a team rather than fighting. Turn on the Hadoop-Signal when you need relatively cheap data storage, batch processing of petabytes, or processing data in non-relational formats. Call out to red-caped Redshift for analytics, fast performance for terabytes, and an easier transition for your PostgreSQL team. As Airbnb concluded in their benchmark: “We don’t think Redshift is a replacement of the Hadoop family due to its limitations, but rather it is a very good complement to Hadoop for interactive analytics”. We Agree.

I’m wondering why wasting 1337 words for an apple-to-oranges comparison.

Original title and link: Hadoop vs Redshift (NoSQL database©myNoSQL)


A guide to write and run Giraph jobs on Hadoop

A good setup guide by Mirko Kämpf:

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph- based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.


Anatomy of the Giraph data flow

Original title and link: A guide to write and run Giraph jobs on Hadoop (NoSQL database©myNoSQL)


Hadoop and Teradata’s business

Earlier today I’ve posted about Teradata’s take on the evolution of databases. As expected, everything is safe and under control. Now this report from Larry Dignan for ZDNet about Teradata Q4 earnings call presents Teradata’s perspective about Hadoop:

Teradata’s fourth quarter earnings were solid, but analysts peppered management with questions about Hadoop as data warehouse revenue worries persist.

Teradata CEO Mike Koehler and CFO Steve Scheppmann talked Hadoop throughout the company’s conference call. Was Hadoop taking Teradata’s business away? What’s the revenue hit? Can Teradata co-exist?

Once again everything is safe with a bright future. Until it isn’t anymore and Hadoop eats the enterprise data warehouse space. In Teradata’s defense, they’ve been one of the first companies that has looked seriously at Hadoop and came up with a coherent positioning.

Original title and link: Hadoop and Teradata’s business (NoSQL database©myNoSQL)