MapReduce Word Count Example
MapReduce is a programming model used for processing large datasets in a distributed manner across clusters of computers. It’s widely used in distributed systems like Apache Hadoop. The MapReduce model consists of two main steps:
- Map: Processes input data and converts it into a set of key-value pairs.
- Reduce: Aggregates the key-value pairs into a smaller set of results.
Below is a MapReduce example that counts the occurrences of each word in a large text file, a typical “word count” example.
Example of Word Count in MapReduce
1. Map Step
The map step processes the input data and generates key-value pairs. Each word in the text file will be mapped to a key-value pair where the key is the word itself, and the value is 1
(representing one occurrence of that word).
Input (example text):
Hello world
Hello MapReduce
MapReduce is awesome
Mapper function (in pseudocode):
def mapper(text):
# Split the text into words
words = text.split()
for word in words:
# Output the key-value pair (word, 1)
emit(word, 1)
Output of the map step:
("Hello", 1)
("world", 1)
("Hello", 1)
("MapReduce", 1)
("MapReduce", 1)
("is", 1)
("awesome", 1)
2. Shuffle and Sort
After the map step, all key-value pairs are shuffled and sorted by key. This groups all the values with the same key (word) together.
Shuffling and Sorting:
("Hello", [1, 1])
("world", [1])
("MapReduce", [1, 1])
("is", [1])
("awesome", [1])
3. Reduce Step
The reduce step processes each key with its associated list of values (frequencies). It aggregates the values (counts the number of occurrences of each word) and outputs the final result.
Reducer function (in pseudocode):
def reducer(word, values):
# Sum up the counts of the word
total_count = sum(values)
emit(word, total_count)
Output of the reduce step:
("Hello", 2)
("world", 1)
("MapReduce", 2)
("is", 1)
("awesome", 1)
Final Output:
The final output of the MapReduce process is the word count for each word in the input text:
("Hello", 2)
("world", 1)
("MapReduce", 2)
("is", 1)
("awesome", 1)
Hadoop Example (Java)
To implement the WordCount example in Hadoop MapReduce, you would typically write the following components:
1. Mapper Class (Java)
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Split the input text into words
String[] words = value.toString().split("\\s+");
// Emit each word with count 1
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}
2. Reducer Class (Java)
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
// Sum up all counts for the word
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
3. Driver Class (Java)
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
// Set the classes for the map and reduce operations
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// Set the output key and value types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// Set input and output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Wait for job completion
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Steps to Run the Job:
- Prepare the Input Data: Place the text data file (e.g.,
input.txt
) into the Hadoop Distributed File System (HDFS). - Compile the Code: Compile the Java classes and package them into a JAR file.
- Submit the Job: Run the job using the
hadoop jar
command.
hadoop jar wordcount.jar WordCountDriver /input /output
Here, /input
is the path to the input file, and /output
is the path where the output (word counts) will be stored.
Conclusion:
This is a basic example of the MapReduce model applied to a word count problem. The Mapper splits the input text into words and emits key-value pairs (word, 1). The Reducer then aggregates the counts for each word and outputs the final word counts. This example can scale to very large datasets when running on a distributed system like Hadoop.