NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data

Zeke J. Miller for Time:

There are also three recommendations that Podesta is encouraging Obama to order the federal government to take up, including extending existing privacy protections to non-U.S. citizens and people not in the country, and ensuring that data collected in schools is only used for educational purposes. Additionally, the report calls on the federal government to build up the capability to be able to spot discriminatory uses of “big data” by companies and the government. “The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining,” Podesta warned.

Original title and link: The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data (NoSQL database©myNoSQL)


Findings of the Big Data and Privacy Working Group Review

John Podesta, the leader of the group assigned by the White House to look at the present and future of Big Data and privacy:

No matter how quickly technology advances, it remains within our power to ensure that we both encourage innovation and protect our values through law, policy, and the practices we encourage in the public and private sector. To that end, we make six actionable policy recommendations in our report to the President

Original title and link: Findings of the Big Data and Privacy Working Group Review (NoSQL database©myNoSQL)


The future of Big Data and its impact on privacy

Tom Simonite summarized the 5 (big) concerns detailed in a White House report about the potential and risks of big data:

The 68-page report was published today and repeatedly emphasizes that big data techniques can advance the U.S. economy, government, and public life. But it also spends a lot of time warning of the potential downsides, saying in the introduction that:

“A significant finding of this report is that big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace.”

I can only hope that having all these clear warning signs at the right level, will only lead to at least a similarly clear legislation protecting the privacy of all.

Original title and link: The future of Big Data and its impact on privacy (NoSQL database©myNoSQL)


Docker, Hadoop and YARN

Jack Clark (The Register) covers the work done to integrate Docker with Hadoop:

“Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the entire unix filesystem image for any YARN container,” explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

Original title and link: Docker, Hadoop and YARN (NoSQL database©myNoSQL)


MapReduce jobs profiling with R

Only good things can come out of this combination. And the code is available on GitHub:

At SequenceIQ in order to profile MapReduce jobs, understand (job)internal statistics and create usefull graphs many times we rely on R. The metrics are collected from Ambari and the YARN History Server.

In this blog post we would like to explain and guide you through a simple process of collecting MapReduce job metrics, calculate different statistics and generate easy to understand charts.

Original title and link: MapReduce jobs profiling with R (NoSQL database©myNoSQL)


The essence of Pig

I love this line from Wes Floyd’s slidedeck:

“Essence of Pig: Map-Reduce is too low a level, SQL too high”

Original title and link: The essence of Pig (NoSQL database©myNoSQL)

Big Data lessons from Netflix

Phil Simon (Wired) covers some details of the Netflix’s “Big Data Platform as a Service @ Netlix” (alternatively titled “Watching Pigs Fly with the Netflix Hadoop Toolkit”):

At Netflix, comparing the hues of similar pictures isn’t a one-time experi­ment conducted by an employee with far too much time on his hands. It’s a regular occurrence. Netflix recognizes that there is tremendous potential value in these discoveries. To that end, the company has created the tools to unlock that value. At the Hadoop Summit, Magnusson and Smith talked about how data on titles, colors, and covers helps Netflix in many ways. For one, analyz­ing colors allows the company to measure the distance between customers. It can also determine, in Smith’s words, the “average color of titles for each customer in a 216-degree vector over the last N days.”

While quite fascinating, I’m wondering how one could prove the value of such details. There’s no way you can run an A/B test or a predictive model or a historic model analysis.

Original title and link: Big Data lessons from Netflix (NoSQL database©myNoSQL)


Amazon Web Services Global Infrastructure Graph

Super-smart and impressive application of a graph database to a real domain:

Wouldn’t it be nice if you could slice and dice through the entire AWS domain of services, data centres and prices all in one spot to optimise your AWS bill? , enter the AWS Global Infrastructure Graph!

Original title and link: Amazon Web Services Global Infrastructure Graph (NoSQL database©myNoSQL)


Spark for Data Science: A Case Study

A great practical intro to Apache Spark by Casey Stella of Hortonworks:

This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover

  • Data Gathering
  • Data Engineering
  • Data Analysis
  • Presentation of Results and Conclusions

Original title and link: Spark for Data Science: A Case Study (NoSQL database©myNoSQL)


Project Secor: Long-term S3 storage for Kafka logs

A new project open sourced by Pinterest, Secor:

Project Secor was born from the need to persist messages logged to Kafka to S3 for long-term storage. Data lost or corrupted at this stage isn’t recoverable so the greatest design objective for Secor is data integrity.

Original title and link: Project Secor: Long-term S3 storage for Kafka logs (NoSQL database©myNoSQL)


Upcoming Webinar: Practical Guide to SQL - NoSQL Migration [sponsor]

This is a reminder for the upcoming webinar organized by myNoSQL supporters at Aerospike:

Avoid common pitfalls of NoSQL deployment with the best practices in this May 8 webinar with Anton Yazovskiy of Thumbtack Technology. He will review key questions to ask before migration, and differences in data modeling and architectural approaches. Finally, he will walk you through a typical application based on RDBMS and will migrate it to NoSQL step by step.

Register now for the webinar.

Original title and link: Upcoming Webinar: Practical Guide to SQL - NoSQL Migration [sponsor] (NoSQL database©myNoSQL)

If HANA fails, SAP dies: Teradata CTO

Ben Rossi (InformationAge) writes in an article about SAP HANA and the possible risk assumed by database companies betting all on in-memory solutions:

Naturally, Oracle, IBM and Microsoft have pushed in-memory technology as extensions to their existing database products using hybrid architectures, while SAP, new to the space altogether, has championed an ‘all or nothing’ architecture.

SAP is not the only company that made this bet. In-memory databases will always be around, so the risk part is only the all-in bet.

Original title and link: If HANA fails, SAP dies: Teradata CTO (NoSQL database©myNoSQL)