

The Reducer’s job is to process the data that comes from the mapper. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The mapper processes the data and creates several small chunks of data. The input file is passed to the mapper function line by line. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). Map stage − The map or mapper’s job is to process the input data. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Generally MapReduce paradigm is based on sending the computer to where the data resides! This simple scalability is what has attracted many programmers to use the MapReduce model. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.

Decomposing a data processing application into mappers and reducers is sometimes nontrivial. Under the MapReduce model, the data processing primitives are called mappers and reducers. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The MapReduce algorithm contains two important tasks, namely Map and Reduce. MapReduce is a processing technique and a program model for distributed computing based on java. MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
