ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

bigdata: All content about bigdata in NoSQL databases and polyglot persistence

Complex data manipulation in Cascalog, Pig, and Hive

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.

I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)

via: http://blog.brunobonacci.com/2014/06/01/cascalog-by-examples-part1/


Time to regulate Big Data?

I had a conversation recently on this subject. As someone born and raised in a communist country, the perspective reality of having no control over what and who owns data about you is very concerning. Terrifying.

For years, data brokers have been collecting and selling billions of pieces of your personal information — from your income to your shopping habits to your medical ailments. Now federal regulators say it’s time you have more control over what’s collected and whether it will be used at all.

After reading this post I was close to cry finally. Then I’ve realized that this bill would need to pass first. And with the right lobbying that might actually never happen (as in “But so far Rockfeller’s bill has gone nowhere).

Original title and link: Time to regulate Big Data? (NoSQL database©myNoSQL)

via: http://money.cnn.com/2014/05/27/pf/ftc-big-data/


Cascading components for a Big Data applications

Jules S. Damji in a quick intro to Cascading:

At the core of most data-driven applications is a data pipeline through which data flows, originating from Taps and Sources (ingestion) and ending in a Sink (retention) while undergoing transformation along a pipeline (Pipes, Traps, and Flows). And should something fail, a Trap (exception) must handle it. In the big data parlance, these are aspects of ETL operations.

You have to agree that when compared with the MapReduce model, these components could bring a lot of readability to your code. On the other hand, at a first glance Cascading API still feels verbose.

Original title and link: Cascading components for a Big Data applications (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/cascading-hadoop-big-data-whatever/


Cloudera, Hadoop, Data warehouses and SLR camera

Amr Adawallah in an interview with Dan Woods for Forbes:

Our advantage is that we can encompass more data and run more workloads with less friction than any other platform. The analogy I use most often is the difference between the SLR camera and the camera on your smart phone. Almost everyone takes more pictures on their smart phone than on their SLR.

The SLR camera is like the enterprise data warehouse. The SLR camera is really, really good at taking pictures, in the same sense that an enterprise data warehouse is really, really good at running queries. But that’s the only thing it does. The data it picks is only exposed to that workload. The system we provide, the enterprise data hub, is more like the smartphone. It can take decent pictures—they won’t be as good as the SLR camera, and in this I’m referring to the Impala system. So Impala will run queries. The queries won’t run at the same interactive OLAP speeds that you get from a high-end data warehouse. However, for many use cases, that performance might be good enough, given that the cost is 10 times lower.

I’ve linked in the past to Ben Thomspon‘s visualizations of the innovator’s dillema:

ben thompson - innovator dilemma

The explanation goes like this: incumbents’ products are usually over-serving consumer needs thus leaving room to new entrants’ good-enough lower-priced products.

Original title and link: Cloudera, Hadoop, Data warehouses and SLR camera (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2014/05/09/clouderas-strategy-for-conquering-big-data-the-enterprise/


The state of big data in 2014

The (big) data market through the eyes of a VC, Matt Turck of FirstMark Capital:

Still early: Overall, we’re still in the early innings of this market. Over the last couple of years, some promising companies failed (for example: Drawn to Scale), a number saw early exits (for example: Precog, Prior Knowledge, Lucky Sort, Rapleaf, Nodeable, Karmasphere), and a handful saw more meaningful outcomes (for example: Infochimps, Causata, Streambase, ParAccel, Aspera, GNIP, BlueFin labs, BlueKai).

Original title and link: The state of big data in 2014 (NoSQL database©myNoSQL)

via: http://venturebeat.com/2014/05/11/the-state-of-big-data-in-2014-chart/


The beauty and challenge of Hadoop

Chad Carson describes in a short but persuasive way how Hadoop gets inside companies and the first challenges that follow:

We hear stories like this all the time, though sometimes the urgent email turns out to be from the CEO! These scenarios follow a common pattern in Hadoop adoption: Hadoop is such a flexible, scalable system that it’s easy for an engineer to quickly grab data that could never before be combined in one place, write some jobs, and get interesting results. Sometimes the results are so interesting that other teams start using them, and all of a sudden the company’s business depends on something that started as an experiment.

Original title and link: The beauty and challenge of Hadoop (NoSQL database©myNoSQL)

via: http://pepperdata.com/blog/2014-05/when-hadoop-sneaks-out-of-the-sandbox/


White House Report Warns Of 'Big Data' Abuses

Devin Coldewey for NBC News:

To that end, the report offers six major policy recommendations:

  • A Consumer Privacy Bill of Rights that codifies what people can expect when opting in or out of data collection programs.
  • Stringent requirements on preventing and reporting data breaches.
  • Privacy protection for more than just U.S. citizens as a global gesture of good faith.
  • Ensure data collected in schools is used only for educational purposes.
  • Prevent big data from being used as a method of discrimination (so-called “digital redlining”).
  • Update the Electronic Communications Privacy Act (ECPA) to be consonant with an age of cloud computing, mobile data, and email.

Who would oppose such clear recommendations and what would be their arguments?

Original title and link: White House Report Warns Of ‘Big Data’ Abuses (NoSQL database©myNoSQL)

via: http://www.nbcnews.com/tech/security/white-house-report-warns-big-data-abuses-n95081


The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data

Zeke J. Miller for Time:

There are also three recommendations that Podesta is encouraging Obama to order the federal government to take up, including extending existing privacy protections to non-U.S. citizens and people not in the country, and ensuring that data collected in schools is only used for educational purposes. Additionally, the report calls on the federal government to build up the capability to be able to spot discriminatory uses of “big data” by companies and the government. “The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining,” Podesta warned.

Original title and link: The White House report recommends that the president take new steps to enhance consumer privacy in the age of big data (NoSQL database©myNoSQL)

via: http://time.com/84338/obama-eyes-enhanced-privacy-protections-in-big-data-era/


Findings of the Big Data and Privacy Working Group Review

John Podesta, the leader of the group assigned by the White House to look at the present and future of Big Data and privacy:

No matter how quickly technology advances, it remains within our power to ensure that we both encourage innovation and protect our values through law, policy, and the practices we encourage in the public and private sector. To that end, we make six actionable policy recommendations in our report to the President

Original title and link: Findings of the Big Data and Privacy Working Group Review (NoSQL database©myNoSQL)

via: http://www.whitehouse.gov/blog/2014/05/01/findings-big-data-and-privacy-working-group-review


The future of Big Data and its impact on privacy

Tom Simonite summarized the 5 (big) concerns detailed in a White House report about the potential and risks of big data:

The 68-page report was published today and repeatedly emphasizes that big data techniques can advance the U.S. economy, government, and public life. But it also spends a lot of time warning of the potential downsides, saying in the introduction that:

“A significant finding of this report is that big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace.”

I can only hope that having all these clear warning signs at the right level, will only lead to at least a similarly clear legislation protecting the privacy of all.

Original title and link: The future of Big Data and its impact on privacy (NoSQL database©myNoSQL)

via: http://www.technologyreview.com/view/527071/five-things-obamas-big-data-experts-warned-him-about/


Docker, Hadoop and YARN

Jack Clark (The Register) covers the work done to integrate Docker with Hadoop:

“Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the entire unix filesystem image for any YARN container,” explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

Original title and link: Docker, Hadoop and YARN (NoSQL database©myNoSQL)

via: http://www.theregister.co.uk/2014/05/02/docker_hadoop/


The essence of Pig

I love this line from Wes Floyd’s slidedeck:

“Essence of Pig: Map-Reduce is too low a level, SQL too high”

Original title and link: The essence of Pig (NoSQL database©myNoSQL)