MapReduce CAT

From CloudScale
Jump to: navigation, search

The MapReduce CAT is a programming model for processing a large amount of data. Therefore, the user has to provide a Map and a Reduce function. These functions define the computation of the data. Thereby, parallelization over distributed machines, failure handling and inter-machine communication is regulated by the framework. Although, the body of the Map and Reduce functions is user-costumized, they restrict the input and output types of the functions. That is, the input of the Map function is a <key,value> pair and it returns a set of <key,value> pairs. Afterward, all values that are stored under the same key, are combined to a list stored under the key. These <key,list of values> pairs serve as an input for the Reduce function that returns typically zero or one value as an output.

Contents

Example

Example CloudScale Architectural Template instantiation for MapReduce.

Assume the book shop example, shown in Simplified_SPOSAD_CAT. One provided functionality could be to count how many times a word occurs in all books in the system. The user would provide the following implementation for the Map and the Reduce function:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");


reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key,result);

The Map function emits a pair of a key and a value. The key equals the word and the value is „1“ for each occurence of the word, e.g. for each "the" in the books that are part of the input set, <the,1> would be emit by the Map function . Afterward, the Reduce funtion takes as an input a key and the related list of values. It sums up the values and returns the overall value of the words‘ occurences in all books.

The roles' constraints assure the characteristics of MapReduce:

  • The output of the Map function has to be a list of <key,value> pairs.
  • The input of the Reduce function has to match with the output of the Map function.

Note that input and output types of the Map and Reduce function depend on the framework, e.g. Hadoop uses not the built-in types of Java. Instead it uses its own types because they are optimized for network serialization.

Context

A large amount of data has to be processed. The data can be partioned into independent blocks, so that parallel processing is facilitated. Furthermore, subresults can be combined to an overall result.

Problem

Imagine you should implement a system that must compute a large amount of data that can be distributed and processed in parallel. Implementing a system that aims at this special purpose can be done. However, the system has to deal with parallelisation over a large cluster of machines, failure handling, load balancing and so on, which makes the implementation of the system much more complex. As the system has to process large data sets, it is important that it achieves a high performance.

Solution

Use MapReduce in a cloud environment that provides the framework.

Structure

In the following figure, we provide the CAT type for MapReduce:

MapReduce CloudScale Architectural Template type.

Dynamics

Assume a user defined Map and Reduce function, like in the word count example.

  1. At first, the input set is split up into a set of M splits. These M splits can be distributed over multiple machines and computed in parallel.
  2. Each of the M splits is read by a Map task that parses the input into <key,value> pairs and sends each pair to the user defined Map function.
  3. The code of the Map function is executed and its output, called intermediate data, is buffered into memory.
  4. Afterward, the intermediate data is sorted regarding their keys. A combiner function can be optionally applied for local aggregation of the values, e.g. in the word count example, it is very likely that a Map returns multiple identical outputs, like <the,1>. The combiner function would be called after the Map function returns the output. It sums up the values of the multiple identical outputs and stores the key together with the sum. Note that the combiner function is also provided by the user but it is not mandatory.
  5. The intermediate data is partioned depending on the number of Reduce tasks. Then the data is written to local disk.
  6. After notification of the Map task, the Reduce task reads the intermediate data and sorts it regarding the key. Afterward, the occurences of the same keys are combined.
  7. The Reduce task iterates over the sorted intermediate data and calls the Reduce function for each <key, list of values> pair. The result is written to a final output file dedicated for this Reduce task.

Implementation

Use Hadoop.

Known Uses

Google, Facebook, Yahoo

Consequences

We identified the following benefits for MapReduce:

  • Systems that have to compute large datasets in parallel are scalable.

We are aware of the following liabilities for MapReduce:

  • Only applicable when the input can be split up into independent blocks.
  • The input/output of the Map and Reduce function is restricted.


References

  • White, T. "Hadoop: The Definitive Guide". O'Reilly Media, Inc., 2009.
  • Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters". Commun. ACM 51, 1 (January 2008), 107-113.
  • Apache Hadoop (hadoop.apache.org)
  • Apache Hadoop Best Practices (developer.yahoo.com)