mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence
Shaun Connolly (Hortonworks) lists the 3 most commons usages of Hadoop in a guest post on GigaOm:
- Data refinery
- Data exploration
- Application enrichment
Nothing new here, except the new buzzwords used to describe those Hadoop use cases that were slowly, but steadily establishing as patterns. And even if they sound nicer than ETL, analytics, etc. I doubt anyone needed new terms.
Original title and link: The three most common ways data junkies are using Hadoop ( ©myNoSQL)
MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.
Start with the definition of the minimal MapReduce algorithms and you’ll find yourself diving into the paper (even if the proof parts are complex).