NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



The state of big data in 2014

The (big) data market through the eyes of a VC, Matt Turck of FirstMark Capital:

Still early: Overall, we’re still in the early innings of this market. Over the last couple of years, some promising companies failed (for example: Drawn to Scale), a number saw early exits (for example: Precog, Prior Knowledge, Lucky Sort, Rapleaf, Nodeable, Karmasphere), and a handful saw more meaningful outcomes (for example: Infochimps, Causata, Streambase, ParAccel, Aspera, GNIP, BlueFin labs, BlueKai).

Original title and link: The state of big data in 2014 (NoSQL database©myNoSQL)


The beauty and challenge of Hadoop

Chad Carson describes in a short but persuasive way how Hadoop gets inside companies and the first challenges that follow:

We hear stories like this all the time, though sometimes the urgent email turns out to be from the CEO! These scenarios follow a common pattern in Hadoop adoption: Hadoop is such a flexible, scalable system that it’s easy for an engineer to quickly grab data that could never before be combined in one place, write some jobs, and get interesting results. Sometimes the results are so interesting that other teams start using them, and all of a sudden the company’s business depends on something that started as an experiment.

Original title and link: The beauty and challenge of Hadoop (NoSQL database©myNoSQL)


What versions of Erlang should you use with CouchDB

Ruseel Branca goes through a list of Erlang versions to identify those that are safe to be used with CouchDB:

There has been some discussion on what versions of Erlang CouchDB should support, and what versions of Erlang are detrimental to use. Sadly there were some pretty substantial problems in the R15 line and even parts of R16 that are landmines for CouchDB. This post will describe the current state of things and make some potential recommendations on approach.

Very useful.

Original title and link: What versions of Erlang should you use with CouchDB (NoSQL database©myNoSQL)


Choice of NoSQL databases from Cloudera

Adam Fowler1 looks at the potential confusion for Cloudera’s customers when talking about NoSQL databases:

As for Cloudera customers I’m not too sure. It may confuse people asking Cloudera about NoSQL. Below is a potential conversation that, as a sales engineer for NoSQL vendor MarkLogic, I can see easily happening:

This announcement struck me as being too publicized — it’s normal for companies with similar interests to partner, but a fair amount of care should be put into clearing all possible confusions and I don’t think this happened.

Just to summarize: Cloudera provides support for HBase and Accumulo. And it has a deal with MongoDB and Oracle. I assume in the sale process, Cloudera will go with: “we work with whatever you already have in place”. As for recommending a NoSQL solution for their customers, it will probably go as in Adam Fowler’s post. To which we could probably add Oracle too.

  1. Adam Fowler works for MarkLogic. 

Original title and link: Choice of NoSQL databases from Cloudera (NoSQL database©myNoSQL)


White House Report Warns Of 'Big Data' Abuses

Devin Coldewey for NBC News:

To that end, the report offers six major policy recommendations:

  • A Consumer Privacy Bill of Rights that codifies what people can expect when opting in or out of data collection programs.
  • Stringent requirements on preventing and reporting data breaches.
  • Privacy protection for more than just U.S. citizens as a global gesture of good faith.
  • Ensure data collected in schools is used only for educational purposes.
  • Prevent big data from being used as a method of discrimination (so-called “digital redlining”).
  • Update the Electronic Communications Privacy Act (ECPA) to be consonant with an age of cloud computing, mobile data, and email.

Who would oppose such clear recommendations and what would be their arguments?

Original title and link: White House Report Warns Of ‘Big Data’ Abuses (NoSQL database©myNoSQL)


The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data

Zeke J. Miller for Time:

There are also three recommendations that Podesta is encouraging Obama to order the federal government to take up, including extending existing privacy protections to non-U.S. citizens and people not in the country, and ensuring that data collected in schools is only used for educational purposes. Additionally, the report calls on the federal government to build up the capability to be able to spot discriminatory uses of “big data” by companies and the government. “The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining,” Podesta warned.

Original title and link: The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data (NoSQL database©myNoSQL)


Findings of the Big Data and Privacy Working Group Review

John Podesta, the leader of the group assigned by the White House to look at the present and future of Big Data and privacy:

No matter how quickly technology advances, it remains within our power to ensure that we both encourage innovation and protect our values through law, policy, and the practices we encourage in the public and private sector. To that end, we make six actionable policy recommendations in our report to the President

Original title and link: Findings of the Big Data and Privacy Working Group Review (NoSQL database©myNoSQL)


The future of Big Data and its impact on privacy

Tom Simonite summarized the 5 (big) concerns detailed in a White House report about the potential and risks of big data:

The 68-page report was published today and repeatedly emphasizes that big data techniques can advance the U.S. economy, government, and public life. But it also spends a lot of time warning of the potential downsides, saying in the introduction that:

“A significant finding of this report is that big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace.”

I can only hope that having all these clear warning signs at the right level, will only lead to at least a similarly clear legislation protecting the privacy of all.

Original title and link: The future of Big Data and its impact on privacy (NoSQL database©myNoSQL)


Docker, Hadoop and YARN

Jack Clark (The Register) covers the work done to integrate Docker with Hadoop:

“Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the entire unix filesystem image for any YARN container,” explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

Original title and link: Docker, Hadoop and YARN (NoSQL database©myNoSQL)


MapReduce jobs profiling with R

Only good things can come out of this combination. And the code is available on GitHub:

At SequenceIQ in order to profile MapReduce jobs, understand (job)internal statistics and create usefull graphs many times we rely on R. The metrics are collected from Ambari and the YARN History Server.

In this blog post we would like to explain and guide you through a simple process of collecting MapReduce job metrics, calculate different statistics and generate easy to understand charts.

Original title and link: MapReduce jobs profiling with R (NoSQL database©myNoSQL)


The essence of Pig

I love this line from Wes Floyd’s slidedeck:

“Essence of Pig: Map-Reduce is too low a level, SQL too high”

Original title and link: The essence of Pig (NoSQL database©myNoSQL)

Big Data lessons from Netflix

Phil Simon (Wired) covers some details of the Netflix’s “Big Data Platform as a Service @ Netlix” (alternatively titled “Watching Pigs Fly with the Netflix Hadoop Toolkit”):

At Netflix, comparing the hues of similar pictures isn’t a one-time experi­ment conducted by an employee with far too much time on his hands. It’s a regular occurrence. Netflix recognizes that there is tremendous potential value in these discoveries. To that end, the company has created the tools to unlock that value. At the Hadoop Summit, Magnusson and Smith talked about how data on titles, colors, and covers helps Netflix in many ways. For one, analyz­ing colors allows the company to measure the distance between customers. It can also determine, in Smith’s words, the “average color of titles for each customer in a 216-degree vector over the last N days.”

While quite fascinating, I’m wondering how one could prove the value of such details. There’s no way you can run an A/B test or a predictive model or a historic model analysis.

Original title and link: Big Data lessons from Netflix (NoSQL database©myNoSQL)