Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
or [2]
Internals of MAP: […] MAP splits up the input value into words. […] MAP is meant to associate each given key/value pair of the input with potentially many intermediate key/value pairs. Internals of REDUCE: […] [REDUCE] performs imperative aggregation (say, reduction): take many values, and reduce them to a single value.
I can not help but think: this is divide & conquer (in the sense of Mergesort), plain and simple! So, is there (conceptual) novelty in MapReduce somewhere, or is it just a new implementation of old ideas useful in certain scenarios?
- US Patent 7,650,331: “System and method for efficient large-scale data processing “ (2010)
- Google’s MapReduce programming model — Revisited by R. Lämmel (2007)
Asked By : Raphael
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/3019
Answered By : Mike Samuel
I can not help but think: this is divide & conquer, plain and simple!
M/R is not divide & conquer. It does not involve the repeated application of an algorithm to a smaller subset of the previous input. It’s a pipeline (a function specified as a composition of simpler functions) where pipeline stages are alternating map and reduce operations. Different stages can perform different operations.
So, is there (conceptual) novelty in MapReduce somewhere, or is it just a new implementation of old ideas useful in certain scenarios?
MapReduce does not break new ground in the theory of computation — it does not show a new way of decomposing a problem into simpler operations. It does show that particular simpler operations are practical for a particular class of problem. The MapReduce paper’s contribution was
- evaluating a pipeline of two well understood orthogonal operators that can be distributed efficiently and fault-tolerantly on a particular problem: creating a text index of large corpus
- benchmarking map-reduce on that problem to show how much data is transferred between nodes and how latency differences in stages affect overall latency
- showing how to make the system fault tolerant so machine failures during computation can be compensated for automatically
- identifying specific useful implementation choices and optimizations
Some of the critiques fall into these classes:
- “Map/reduce does not break new ground in theory of computation.” True. The original paper’s contribution was that these well-understood operators with a specific set of optimizations had been successfully used to solve real problems more easily and fault-tolerantly than one-off solutions.
- “This distributed computation doesn’t easily decompose into map & reduce operations”. Fair enough, but many do.
- “A pipeline of n map/reduce stages require latency proportional to the number of reduce steps of the pipeline before any results are produced.” Probably true. The reduce operator does have to receive all its input before it can produce a complete output.
- “Map/reduce is overkill for this use-case.” Maybe. When engineers find a shiny new hammer, they tend to go looking for anything that looks like a nail. That doesn’t mean that the hammer isn’t a well-made tool for a certain niche.
- “Map/reduce is a poor replacement for a relational DB.” True. If a relational DB scales to your data-set then wonderful for you — you have options.