Today we are going to implement the very famous WordCount code in Spark in spark-shell.
For folks who are not familiar with WordCount, in this implementation, we count the occurrences of each word and as a result present a pair of word and their respective count.
For example, if my input is as follows
Hi this is Milind
Hi Big Data
Then WordCount output will look something like this
Hi => 2
this => 1
is => 1
Milind => 1
Big => 1
Data => 1
You can clearly see that the left-hand side of the arrow indicates each word in the input whereas the right-hand side indicates the respective count of each word.
Now that we know what WordCount is, we will proceed with the implementation in Spark Shell.
In this, we are going to follow below steps.
Now we will look into these steps one by one.
CREATE INPUT FILE
We are going to need an input file for implementing the WordCount logic.
We will use nano command to create a file.
Following screenshots will guide you regarding creating a text file with the help of nano command.
nano command syntax to create input.txt
writing contents in input.txt file
verifying contents written successfully in input.txt
OPEN SPARK SHELL
opening spark shell to execute spark WordCount commands one by one
EXECUTE WORDCOUNT COMMANDS LINE BY LINE
As you can see in above screenshot, we are taking input.txt as an input file.
Once input.txt file is loaded in variable called input_file, then we are printing the contents of that input_file variable to verify that the file got loaded successfully.
Now, the next step as discussed is to split the input data in words and then count each word those many numbers of time.
As you can see from above picture, WordCount logic is divided into three parts.
We will see those parts like follows.
- Step 1 : Split each line with SPACE (” “) as a delimiter.
- Step 2 : Map each word with a default count of 1.
- Step 3 : Apply reduce operation which will by default make each word as key and sum the values for all keys.
At last we are printing the value of output variable called step3.
Printing the output variable, we are now certain that WordCount executed successfully.
Now it is time to store this output variable in a file.
This is done with the help of saveAsTextFile() command shown in the screenshot.
VIEW AND VERIFY OUTPUT
Once you execute store command, an output directory will be created in the same directory.
You can run ls command to check the output directory content and you will see the _SUCCESS file and the part file.
You can print the contents of the part file to view the final output.
All these steps are shown in the screenshot above.
The entire file is uploaded to my GitHub profile which looks something like this.
I believe the explaination and screenshots help.
Hope you have a great read.
Kindly let me know your thoughts.