MapReduce

todo: this has been copied from Notion, and has to be split into multiple documents

MapReduce is a programming model in order to solve the three main challenges of cluster architecture. It allows to process big datasets in parallel, distributing and balancing the load between the nodes in the cluster.

This pattern is useful if there are many read operations and few updates for large files, and mostly if the jobs are not interactive and not in real time.

It’s not suitable for problems that requires random access to data, since the file is split in contiguous blocks; for working with graph or with interdependent data.

Using an implementation of this model, the nodes have both their local systems and a distributed file system (HDFS in the case of Hadoop).

There are three main components:

Chunk Severs: Large data files are split into contiguous chunks of a fixed size, usually of 16 to 64 MB. Each chunk is also replicated across multiple nodes, usually there are 2 or 3 replicas per chunk, each replica on a different node and at least one replica on a different rack, in order to prevent crashes in case the whole rack fails.

The computation is moved where the data is, in order to reduce network communication.
Master Nodes: its purpose is to store metadata about files in the distirbuted filesystem. An example of this metadata is the number of chunks each file is split into, and in which node each chunk is located.

There are more than just one master node in order to prevent the whole system failure in case of a master node crash.
Client API: it allows the clients to access the data stored on the chunk servers, by asking the master node throught the API where a particular chunk is located. After the master node replies with the information needed, the communication between the client and the chunk server happens directly without the need of the master node.

MapReduce process

The MapReduce paradigm takes as input a set of $(k ey, v a l u e)$ pairs, and outputs another set of $(k ey, v a l u e)$ pairs.

The programmer has to define just two methods, that are the $map$ and the $reduce$ . The framework will provide an intermediate $shuffle$ step between the two methods, that most of the time is crucial for the resolution of the problem.

Let ${(k_{1}, v_{1}), (k_{2}, v_{2}), \dots, (k_{n}, v_{n})}$ the input pairs, then the two methods are defined as following:

$map (k_{i}, v_{i}) \to {(k_{i}^{'}, v_{i}^{'})}^{*}$ , meaning it will output a sequence of zero or more $(k ey, v a l u e)$ pairs; The $map$ function is called for each key value pair.
$\text{reduce}(k'_i, \{v'_i\}^_) \to \{(k'_i, v''_i)\}^_$ , meaning that all values $v_{i}^{'}$ with the same key $k_{i}^{'}$ are reduced togheter. The $reduce$ function is called for each unique key $k_{i}^{'}$ .

Example of a MapReduce scenario

Suppose you have to count the words of a very large text document (10s of TB). Since it will not entirely fit in the main memory, we can use MapReduce to attack the problem.

The text file can be split in different blocks of text, so in this case the input will be a list of key-value pairs of the type $(recordID, textBlock)$ .

The code that can solve this probelm, and nicely fits the MapReduce philosophy is the following bash script:

print_words(doc.txt) | sort | uniq -c

The $map$ function is print_words(), since it takes in input a key-value pair, and outputs a list of key-value pairs of the type $(w or d, 1)$ . Note that in the list the values are always $1$ , meaning that if words appear multiple times, then there are multiple instances of the same key-value pair.
The $shuffle$ function is sort, that aims to put togheter all the pairs that have the same key.
The $reduce$ function is uniq -c, that gets in input a list of key value pairs of the type $(w or d, {1, 1, ...})$ and outputs the result $(w or d, n)$ where $n$ is the sum of the values.

Note that boh the input and output are stored in the distributed file system, while all the intermediate steps are stored in the node’s local file system, and will be stored in the distributed one only after the reduce function is applied.

Physical Block vs Logical Split

A physical block is the physical size in which block are divided in the file system. Since each mapper is assigned to a block, we can define a logical split, that can be a fraction of the physical block or a multiple of it, depending of the situation. Every mapper will be assigned to each logical split.

There is a trade-off between making logical splits of the right size.

The reducer is then applied to each group that shares the same common key.

Screenshot 2023-03-03 at 11.52.50 AM.png

Failure Detection

The master node is in charge of detecting the failures, since it periodically pings mappers and reducers to see if they’re still up. Whenever a map task is completed, the mapper sends a notification to the master node.

In case of failures, we have to distinct different cases, depending on which node fails:

The map worker node fails: all the taks that are completed or in progress in the failed worker node are reset to idle, and will eventually be reassined later to another node;
The reduce worker node fails: in this case only the tasks that are in progress are reset to idle, since the completed tasks results are already in the distributed file system;
The master node fails: in this case the whole process is aborted.

How many map and reduce tasks should we have?

Usually, we should have $M ≫ N$ and $R < M$ where $M$ are the number of map tasks, $R$ are the number of reduce tasks and $N$ are the nodes in the cluster.

This because $M ≫ N$ assures us that the recovery speed from failure is very high.

Combiners

A map task may produce multiple identical key-value pairs, and it may be not so efficient to pass all those pairs to the reducer.

Combiners work just like reducers, but on the node where the map task is executed. In this way, the node will send fewer key-value pairs through the network to the reducers.

This can be done only when it’s possible for the reducer to aggregate already aggregated values, meaning the operation has to be commutative and associative. For example sum has those properties, but the avg operation doesn’t. It’s possible to use a trick for the latter, by just aggregating the key-value paris not by doing the average, but by storing the total values and the number of instances, in that way the reducer can make the final average.

Some operations though are not compatible, like the taking the median.

Partition Function

The partiton function has the role to control how the intermediate key-value pairs produced by the map task are distributed across the reducers.

Let $R$ be the number of reducers, the default partition function can be:

$hash (k ey) mod R$

If the hash function is well made, the distribution of values will be equally distributed, and the pairs with the same key go at the same reducer node.

Limitations of the MapReduce paradigm

The two major limitations are that:

It’s difficult to program directly;
The I/O communication bottlenecks can cause performance issues.

tags: big-data cloud-computing system-architecture

Quartz 4

Explorer