NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Big Data lessons from Netflix

Phil Simon (Wired) covers some details of the Netflix’s “Big Data Platform as a Service @ Netlix” (alternatively titled “Watching Pigs Fly with the Netflix Hadoop Toolkit”):

At Netflix, comparing the hues of similar pictures isn’t a one-time experi­ment conducted by an employee with far too much time on his hands. It’s a regular occurrence. Netflix recognizes that there is tremendous potential value in these discoveries. To that end, the company has created the tools to unlock that value. At the Hadoop Summit, Magnusson and Smith talked about how data on titles, colors, and covers helps Netflix in many ways. For one, analyz­ing colors allows the company to measure the distance between customers. It can also determine, in Smith’s words, the “average color of titles for each customer in a 216-degree vector over the last N days.”

While quite fascinating, I’m wondering how one could prove the value of such details. There’s no way you can run an A/B test or a predictive model or a historic model analysis.

Original title and link: Big Data lessons from Netflix (NoSQL database©myNoSQL)


Connected devices, side data selling businesses, and privacy

Parmy Olson (Forbes) looks into an alternative route businesses selling connected devices (e.g. Nest, Fitbit, Jawbone) are looking into:

For privacy reasons both self-insured employers and those with group insurance have to bring on a population-management firm such as StayWell or Welltok to manage the data as a neutral third party. Amy McDonough, who oversees Fitbit’s employer program, wouldn’t comment on how Fitbit data would affect pricing negotiations between employers and health care providers, though health insurer Cigna said fitness trackers “may” have an impact on future group insurance pricing . The data are still being tested.

The conclusion is what worries me:

In other words, most people don’t really care about how many steps they’ve taken each day, but they do care about their insurance and energy bills.

How long before we get a series of completely opaque industry metrics like credit scores that will determine your health and life insurance, your aptitude for taking a job, or attending a school? It all starts with a little carrot at the end of a stick. If not accompanied by strict regulations, it will just become another discriminating cash cow for large corporations.

Original title and link: Connected devices, side data selling businesses, and privacy (NoSQL database©myNoSQL)


A retrospective of two years of Big Data with Andrew Brust

Andrew Brust on his way out from ZDNet to GigaOm Research:

As much as I chide the Hadoop world for having started out artificially siloed and aloof, it did the industry a great service: it took the mostly- ossified world of databases, data warehouses and BI and made it dynamic again.

Suddenly, the incumbent players had to respond, add value to their products, and innovate rapidly. It’s hard to imagine that having happened without Hadoop.

Original title and link: A retrospective of two years of Big Data with Andrew Brust (NoSQL database©myNoSQL)


Hadoop distro for IBM's Mainframe

IBM and its partner Veristorm are working to merge the worlds of big data and Big Iron with zDoop, a new offering unveiled last week that offers Apache Hadoop running in the mainframe’s Linux environment.

3 hip hip hoorays for Hadoop on mainframes.

Original title and link: Hadoop distro for IBM’s Mainframe (NoSQL database©myNoSQL)


Which companies produce more than 10TB of data per day?

Couple of interesting answers on Quora, but this part from Michael E. Driscoll’s answer is quite interesting:

You could even get 100s of daily TBs of data yourself:  if you can afford the network bandwidth fees, there are ~100 marketplaces (Twitter’s MoPub, Google’s AdX, Facebook’s FBX to name a few) that surface approximately 200 Billion advertising auctions per day.  You can build a bidder, get a seat on their exchanges, and make millions of daily trades — you’ll just need to convince a brand to act as their broker, and take your 20% cut of spend.

Original title and link: Which companies produce more than 10TB of data per day? (NoSQL database©myNoSQL)


Hadoop and big data: Where Apache Slider slots in and why it matters

Arun Murthy for ZDNet about Apache Slider:

Slider is a framework that allows you to bridge existing always-on services and makes sure they work really well on top of YARN without having to modify the application itself. That’s really important.

Right now it’s HBase and Accumulo but it could be Cassandra, it could be MongoDB, it could be anything in the world. That’s the key part.

I couldn’t find the project on the Incubator page.

Original title and link: Hadoop and big data: Where Apache Slider slots in and why it matters (NoSQL database©myNoSQL)


Price Comparison for Big Data Appliance and Hadoop

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to- end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster “loading dock to loading data” time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.

This is coming from Oracle. Now, without nitpicking prices — I’m pretty sure you’ll find better numbers for the different components — how do you sell Hadoop to the potential customer that took a look at this?

Original title and link: Price Comparison for Big Data Appliance and Hadoop (NoSQL database©myNoSQL)


Hadoop analytics startup Karmasphere sells itself to FICO

Derrick Harris (GigaOm):

The Fair Isaac Corporation, better known as FICO, has acquired the intellectual property of Hadoop startup Karmasphere. Karmasphere launched in 2010, and was one of the first companies to push the idea of an easy, visual interface for analyzing Hadoop data, and even analyzing it using traditional SQL queries.

Original title and link: Hadoop analytics startup Karmasphere sells itself to FICO (NoSQL database©myNoSQL)


We will find the author of the Bitcoin whitepaper even if he doesn’t want us to

Nermin Hajdarbegovic (CoinDesk):

A group of forensic linguistics experts from Aston University believe the real creator of bitcoin is former law professor Nick Szabo.

Dr. Grieve explained:

The number of linguistic similarities between Szabo’s writing and the bitcoin whitepaper is uncanny, none of the other possible authors were anywhere near as good of a match.

Privacy is all gone.

Original title and link: We will find the author of the Bitcoin whitepaper even if he doesn’t want us to (NoSQL database©myNoSQL)

Hortonworks: the Red Hat of Hadoop

However, John Furrier, founder of SiliconANGLE, posits that Hortonworks, with their similar DNA being applied in the data world, is, in fact, the Red Hat of Hadoop. “The discipline required,” he says, “really is a long game.”

It looks like Hortonworks’s positioning has been successful in that they are now perceived as the true (and only) open sourcerers.

Original title and link: Hortonworks: the Red Hat of Hadoop (NoSQL database©myNoSQL)


Apache Hadoop 2.4.0 released with operational improvements

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS
  • Native support for Rolling Upgrades in HDFS
  • Smooth operational upgrades with protocol buffers for HDFS FSImage
  • Full HTTPS support for HDFS
  • Support for Automatic Failover of the YARN ResourceManager (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server and Application Timeline Server
  • Support for strong SLAs in YARN CapacityScheduler via Preemption

Original title and link: Apache Hadoop 2.4.0 released with operational improvements (NoSQL database©myNoSQL)


Your Big Data Is Worthless if You Don’t Bring It Into the Real World

Building on the (exact) same premise as last week’s article Big data: are we making a big mistake?, Mikkel Krenchel and Christian Madsbjerg write for Wired:

Not only did Google Flu Trends largely fail to provide an accurate picture of the spread of influenza, it will never live up to the dreams of the big- data evangelists. Because big data is nothing without “thick data,” the rich and contextualized information you gather only by getting up from the computer and venturing out into the real world. Computer nerds were once ridiculed for their social ineptitude and told to “get out more.” The truth is, if big data’s biggest believers actually want to understand the world they are helping to shape, they really need to do just that.

While the authors actually mean the above literally, I think the valid point the article could have made is that looking at a data set alone without considering:

  1. possibly missing data,
  2. context data and knowledge,
  3. and field know-how

can lead to incorrect conclusions — the most obvious examples being the causal fallacy and the correlation-causation confusions.

✚ Somehow related to the “possibly missing data” point, the article How politics makes us stupid brings up some other very interesting points.

Original title and link: Your Big Data Is Worthless if You Don’t Bring It Into the Real World (NoSQL database©myNoSQL)