ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence

LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop


Hadoop will be made better through engineering

Dan Woods prefacing an interview with Scott Gnau of Teradata:

In this vision, because Hadoop can store unstructured and structured information, because it can scale massively, because it is open source, because it allows many forms of analysis, because it has a thriving ecosystem, it will become the one repository to rule them all.

In my view, the most extreme advocates for Hadoop need to sober up and right size both expectations and time frames. Hadoop is important but it won’t replace all other repositories. Hadoop will change the world of data, but not in the next 18 months. The open source core of Hadoop is a masterful accomplishment, but like many open source projects, it will be made better through engineering.

You have to agree: there’s no engineering behind Hadoop. Just a huge number of intoxicated… brogrammers.

Original title and link: Hadoop will be made better through engineering (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2013/12/03/teradatas-scott-gnau-praises-hadoop-for-the-right-reasons/


Pigs can build graphs too for graph analytics

Extremely interesting and intriguing usage and extension of Pig at Intel:

Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User- Defined Functions (UDF’s) and macros in Pig Latin.

Original title and link: Pigs can build graphs too for graph analytics (NoSQL database©myNoSQL)

via: http://blogs.intel.com/intellabs/2013/12/17/pigs-can-build-graphs-too/


Datameer raises $19 Million

Announced yesterday:

“This funding is entirely about allowing us to meet the nonstop global demand for our product. Across every industry, companies are moving past Hadoop science projects and realizing they need a proven big data analytics tool that finally frees them from schemas and ETL,” said Stefan Groschupf, CEO of Datameer.

Funding in the Hadoop space is at a higher level than the pure NoSQL databases market. In the Big Data/BI market it’s easier to grasp the competitors and the market potential they’re fighting for. In the NoSQL market, many are still afraid to think that some of these players will actually make (big) dents into incumbents’ market segments.

Original title and link: Datameer raises $19 Million (NoSQL database©myNoSQL)

via: http://www.datameer.com/company/news/press-releases/datameer-secures-19million-global-demand-self-service-bigdata-analytics-hadoop.html


Challenges and Opportunities for Big Data - an interview with Actian's CTO Mike Hoskins

Roberto V. Zicari interviews Actian’s CTO Mike Hoskins:

Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.

Original title and link: Challenges and Opportunities for Big Data - an interview with Actian’s CTO Mike Hoskins (NoSQL database©myNoSQL)

via: http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/


Picking the Right Platform: Big Data or Traditional Warehouse?

Stephen Swoyer (tdwi) is summarizing Richard Winter’s research into the topic of cost-based efficiency of Hadoop vs data warehouses:

“Under what circumstances, in fact, does Hadoop save you a lot of money, and under what circumstances does a data warehouse save you a lot of money?”

The conversation happened at a Teradata event, so you might already guess some of the findings. Anyways without seeing the data it’s difficult to agree or disagree:

In fact, he argued that misusing Hadoop for some types of decision support workloads could cost up to 2.8x more than a data warehouse.

Original title and link: Picking the Right Platform: Big Data or Traditional Warehouse? (NoSQL database©myNoSQL)

via: http://tdwi.org/Articles/2013/12/17/Picking-Right-DW-Platform.aspx?Page=1&p=1


The three most common ways data junkies are using Hadoop

Shaun Connolly (Hortonworks) lists the 3 most commons usages of Hadoop in a guest post on GigaOm:

  1. Data refinery
  2. Data exploration
  3. Application enrichment

Nothing new here, except the new buzzwords used to describe those Hadoop use cases that were slowly, but steadily establishing as patterns. And even if they sound nicer than ETL, analytics, etc. I doubt anyone needed new terms.

Original title and link: The three most common ways data junkies are using Hadoop (NoSQL database©myNoSQL)


Hadoop in Fedora 20

Being included in the default Fedora distro is yet another big step for Hadoop.

The hardest part about getting Hadoop into Fedora? “Dependencies, dependencies, dependencies!” says Farrellee. […]

For Hadoop? It was more difficult than usual. “There were some dependencies that were just missing and we had to work through those as you’d expect - there were a lot of these. Then there were dependencies that were older than what upstream was using - rare, I know, for Fedora, which aims to be on the bleeding edge. The hardest to deal with were dependencies that were newer than what upstream was using. We tried to write patches for these, but we weren’t always successful. […]”

On the other hand, one thing that continues to puzzle me is: how many different people coming from different backgrounds need to say that Hadoop is crazy complex?

Original title and link: Hadoop in Fedora 20 (NoSQL database©myNoSQL)

via: http://www.linux.com/news/featured-blogs/196-zonker/752637-focus-on-fedora-20-features-hadoop-in-heisenbug


Minimap MapReduce Algorithms - Paper

Abstract of the paper authored by a team from universities in Hong Kong, Korea, and Singapore:

MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

Start with the definition of the minimal MapReduce algorithms and you’ll find yourself diving into the paper (even if the proof parts are complex).


HAWK: Performance monitoring tool for Hive

JunHo Cho’s slides introducing HAWK, a performance monitoring tool for Hive:

✚ I couldn’t find a link for HAWK. The slides are pointing to NexR.

Original title and link: HAWK: Performance monitoring tool for Hive (NoSQL database©myNoSQL)


Apache Ambari is now an Apache Top Level Project

Hortonworks:

We are very excited to announce that Apache Ambari has graduated out of Incubator and is now an Apache Top Level Project!

Ambari is a framework for provisioning, managing, and monitoring Hadoop clusters.

✚ Such a tool is usually part of the distributions of Hadoop and in some cases it comes in a proprietary form.

Original title and link: Apache Ambari is now an Apache Top Level Project (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/apache-ambari-graduates-to-apache-top-level-project/


A quick guide to using Sentry authorization in Hive

A guide to Apache Sentry:

Sentry brings in fine-grained authorization support for both data and metadata in a Hadoop cluster. It is already being used in production systems to secure the data and provide fine-grained access to its users. It is also integrated with the version of Hive shipping in CDH (upstream contribution is pending), Cloudera Impala, and Cloudera Search.

Original title and link: A quick guide to using Sentry authorization in Hive (NoSQL database©myNoSQL)

via: https://blogs.apache.org/sentry/entry/getting_started