Testing your Hadoop program with Maven on IntelliJ

Published in

Analytics Vidhya

5 min readJan 8, 2020

In this tutorial we will understand a way you can write and test your Hadoop Program with Maven on IntelliJ without configuring Hadoop environment on your own machine or using any cluster. This is not a word count map reduce code tutorial a basic understanding of map-reduce functionalities is assumed.

REQUIREMENTS

SDK
IntelliJ (Click to download)
Linux or Mac OS

CREATING A NEW PROJECT

Click create new project and choose Maven then click next

Set your project name, project location, groupId, and artifactId. Leave the version untouched and click finish.

Now we are ready to configure our project dependencies

Configuring Dependencies

Open the pom.xml file. This file is often the default opening screen after clicking finish. Click enable Auto-import but you can also import changes if you prefer to be notified every time you edit your pom.xml file.

In your pom.xml file post the following blocks before the project closing tag </project>

<repositories>
    <repository>
        <id>apache</id>
        <url>http://maven.apache.org</url>
    </repository>
</repositories><dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.2.0</version>
    </dependency>
</dependencies>

The final pom file should look like the following

Below is the full pom.xml file

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.Word</groupId>
    <artifactId>WordCount</artifactId>
    <version>1.0-SNAPSHOT</version>
    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.2.0</version>
        </dependency>
    </dependencies>

</project>

Now we are ready to create classes for our sample test project WordCount.

Creating a WordCount class

Proceed to src -> main -> java package and create a new class

Name the class and click and enter

Paste the following Java code in your wordCount class.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class wordCount {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(wordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

The wordCount class code includes both the main method, the map class and the reduce class. It scans all text files in the folder defined by the first argument, and outpout the frequencies of all words into a folder defined by the second argument.

We are almost ready to run the program….

First we must create our text input file. In your project package create new folder and name it input. Then within the input folder/directory create your txt file or drag one if you already have.

Copy and paste some texts within this file

Almost ready be patient…

We have not set our program arguments. Select Run → Edit Configuration.

Add a new Application Configuration by selecting “+” then Application.

Set the Main class be wordCount, set Program arguments be input output. This allows the program to read from input folder and save the result to output folder. Do not create the output folder, as Hadoop will create the folder automatically. If the folder exists, Hadoop will raise an exception. When done select apply then ok.

Now we are ready to run our program….

Select Run→Run 'WordCount' to run the Hadoop program. If you re-run the program, delete the output folder before.

An output folder will appear. On each run your results are saved in output→part-r-00000.

Possible issues on Mac

If you have the latest version of Java running on your mac you may encounter the following error.

Solution

My system is using Javac version 9 to compile the program so I set the following to my Javac compiler version.

File -> Project structure -> Project -> Project SDK -> 9.
File -> Project structure -> Project -> Project language level -> 9.
File -> Project structure -> Project -> Modules -> -> Sources → 9…
In project -> ctrl + alt + s -> Build, Execution, Deployment -> Compiler -> Java Compiler -> Project bytecode version -> 9.
Intellij IDEA -> Build, Execution, Deployment -> Compiler -> Java Compiler -> Module -> 1.9.