Saturday, January 18, 2025
HomeProgrammingWhat is the Example of Word Count in MapReduce?

What is the Example of Word Count in MapReduce?

MapReduce Word Count Example

MapReduce is a programming model used for processing large datasets in a distributed manner across clusters of computers. It’s widely used in distributed systems like Apache Hadoop. The MapReduce model consists of two main steps:

  1. Map: Processes input data and converts it into a set of key-value pairs.
  2. Reduce: Aggregates the key-value pairs into a smaller set of results.

Below is a MapReduce example that counts the occurrences of each word in a large text file, a typical “word count” example.

Example of Word Count in MapReduce

1. Map Step

The map step processes the input data and generates key-value pairs. Each word in the text file will be mapped to a key-value pair where the key is the word itself, and the value is 1 (representing one occurrence of that word).

Input (example text):

Hello world
Hello MapReduce
MapReduce is awesome

Mapper function (in pseudocode):

def mapper(text):
    # Split the text into words
    words = text.split()
    for word in words:
        # Output the key-value pair (word, 1)
        emit(word, 1)

Output of the map step:

("Hello", 1)
("world", 1)
("Hello", 1)
("MapReduce", 1)
("MapReduce", 1)
("is", 1)
("awesome", 1)

2. Shuffle and Sort

After the map step, all key-value pairs are shuffled and sorted by key. This groups all the values with the same key (word) together.

See also  What are function wrappers in Python?

Shuffling and Sorting:

("Hello", [1, 1])
("world", [1])
("MapReduce", [1, 1])
("is", [1])
("awesome", [1])

3. Reduce Step

The reduce step processes each key with its associated list of values (frequencies). It aggregates the values (counts the number of occurrences of each word) and outputs the final result.

Reducer function (in pseudocode):

def reducer(word, values):
    # Sum up the counts of the word
    total_count = sum(values)
    emit(word, total_count)

Output of the reduce step:

("Hello", 2)
("world", 1)
("MapReduce", 2)
("is", 1)
("awesome", 1)

Final Output:

The final output of the MapReduce process is the word count for each word in the input text:

("Hello", 2)
("world", 1)
("MapReduce", 2)
("is", 1)
("awesome", 1)

Hadoop Example (Java)

To implement the WordCount example in Hadoop MapReduce, you would typically write the following components:

1. Mapper Class (Java)

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        // Split the input text into words
        String[] words = value.toString().split("\\s+");

        // Emit each word with count 1
        for (String str : words) {
            word.set(str);
            context.write(word, one);
        }
    }
}

2. Reducer Class (Java)

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        
        // Sum up all counts for the word
        for (IntWritable val : values) {
            sum += val.get();
        }
        
        result.set(sum);
        context.write(key, result);
    }
}

3. Driver Class (Java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        
        // Set the classes for the map and reduce operations
        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // Set the output key and value types
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // Set input and output paths
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Wait for job completion
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Steps to Run the Job:

  1. Prepare the Input Data: Place the text data file (e.g., input.txt) into the Hadoop Distributed File System (HDFS).
  2. Compile the Code: Compile the Java classes and package them into a JAR file.
  3. Submit the Job: Run the job using the hadoop jar command.
hadoop jar wordcount.jar WordCountDriver /input /output

Here, /input is the path to the input file, and /output is the path where the output (word counts) will be stored.

See also  How can I save data to a CSV file using Python?

Conclusion:

This is a basic example of the MapReduce model applied to a word count problem. The Mapper splits the input text into words and emits key-value pairs (word, 1). The Reducer then aggregates the counts for each word and outputs the final word counts. This example can scale to very large datasets when running on a distributed system like Hadoop.

RELATED ARTICLES
0 0 votes
Article Rating

Leave a Reply

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
- Advertisment -

Most Popular

Recent Comments

0
Would love your thoughts, please comment.x
()
x