MapReduce is a software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming.[1]

MapReduce implementations have been written in C++, Java and other languages.

MapReduce implementations have been written in [C++], [Java (programming language)|Java] and other languages.

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

an input reader a Map function a partition function a compare function a Reduce function an output writer

Input reader

The input reader divides the input into 16MB to 128MB splits and the framework assigns one split to each Map function. The input reader reads the data from stable storage (typically a distributed file system like Google File System) and generates key/value pairs.

A common example will read a directory full of text files and return each line as a record.

Map function

Each Map function gets series of key/value pairs; processes each; and generates 0 or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.

If the application is doing a word count, the map function would break the line into words and output the word as the key and 1 as the value.

Partition function

The output of all of the maps is allocated to particular reduces by the applications's partition function. The partition function is given the key and the number of reduces and returns the index of the desired reduce.

A typical default is to hash the key and modulo the number of reduces.

Comparison function

The input for each reduce is pulled from the machine where the map ran and sorted using the application's comparison function.

Reduce function

The framework calls the application's reduce function once for each unique key in the sorted order. The reduce can iterate through the values that are associated with that key and output 0 or more key/value pairs.

In the word count example, the reduce function takes the input values, sums them and generates a single output of the word and the final sum.

Output writer

The Output Writer writes the output of the reduce to stable storage, usually a distributed file system, such as Google File System.