Hello friends,

Today we are going to implement the very famous WordCount code in Spark in spark-shell.

For folks who are not familiar with WordCount, in this implementation, we count the occurrences of each word and as a result present a pair of word and their respective count.
For example, if my input is as follows

————————–
Hi this is Milind
Hi Big Data
————————–

Then WordCount output will look something like this
Hi => 2
this => 1
is => 1
Milind => 1
Big => 1
Data => 1

You can clearly see that the left-hand side of the arrow indicates each word in the input whereas the right-hand side indicates the respective count of each word.

Now that we know what WordCount is, we will proceed with the implementation in Spark Shell.
In this, we are going to follow below steps.

Steps
Steps

Now we will look into these steps one by one.

CREATE INPUT FILE

We are going to need an input file for implementing the WordCount logic.
We will use nano command to create a file.
Following screenshots will guide you regarding creating a text file with the help of nano command.

nano command syntax to create input.txt

input.txt creation
input.txt creation

writing contents in input.txt file

input.txt contents
input.txt contents

verifying contents written successfully in input.txt

input.txt contents confirmation
input.txt contents confirmation

OPEN SPARK SHELL

opening spark shell to execute spark WordCount commands one by one

starting spark shell
starting spark shell

EXECUTE WORDCOUNT COMMANDS LINE BY LINE

input data from input.txt
input data from input.txt

As you can see in above screenshot, we are taking input.txt as an input file.
Once input.txt file is loaded in variable called input_file, then we are printing the contents of that input_file variable to verify that the file got loaded successfully.
Now, the next step as discussed is to split the input data in words and then count each word those many numbers of time.

wordcount logic
wordcount logic

As you can see from above picture, WordCount logic is divided into three parts.
We will see those parts like follows.

  • Step 1 : Split each line with SPACE (” “) as a delimiter.
  • Step 2 : Map each word with a default count of 1.
  • Step 3 : Apply reduce operation which will by default make each word as key and sum the values for all keys.

At last we are printing the value of output variable called step3.
Printing the output variable, we are now certain that WordCount executed successfully.

Now it is time to store this output variable in a file.
This is done with the help of saveAsTextFile() command shown in the screenshot.

VIEW AND VERIFY OUTPUT

final wordcount output
final wordcount output

Once you execute store command, an output directory will be created in the same directory.
You can run ls command to check the output directory content and you will see the _SUCCESS file and the part file.
You can print the contents of the part file to view the final output.
All these steps are shown in the screenshot above.

The entire file is uploaded to my GitHub profile which looks something like this.

I believe the explaination and screenshots help.
Hope you have a great read.

Kindly let me know your thoughts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s