Hey, everyone, it’s been so long. Been busy with the final exams and presentations for the last couple of weeks. Now that the semester is over, you can expect frequent and more detailed updates from this blog. So, let us get started then.

In the last tutorial, we saw the GROUP operation using Apache Pig. This tutorial focuses on removing records which contain NULL values. It is one of the steps in data preprocessing which we use while doing the Text Mining.

Let us take a look at the steps we are going to follow to achieve it.

flowchart
operation flowchart

Above flowchart shows us the step by step process of solving this objective in the HDPCD certification. We will perform each task in a sequence in the following way.

  • INPUT FILE CREATION

The input file is created in the local file system with the help of vi editor. For demonstration purposes, I have deliberately put NULL values in the file. I have uploaded this file to my GitHub profile in HDPCD repository and you can download it from here.

Once you do download the file, the following commands can be used to create this in your local file system.

vi post16.csv

#####

PASTE THE CONTENTS HERE

#####

(esc):wq(enter)

cat post16.csv

The following screenshot might be helpful for you.

input file creation
input file creation

This input file looks something like this.

  • PUSHING FILE TO HDFS

Now, the file is in the local file system. Let us push it to HDFS.

We are going to use the following set of commands to do this.

hadoop fs -mkdir /hdpcd/input/post16

hadoop fs -put post16.csv /hdpcd/input/post16

hadoop fs -cat /hdpcd/input/post16/post16.csv

The following screenshot will give you an idea about the execution of the above commands.

input file to HDFS
input file to HDFS

Now, we have the file in HDFS, it is time to create the pig script. I have uploaded this pig script to my GitHub profile under HDPCD repository and you can download it from here.

It looks something like this.

Let us look at each command.

input_data = LOAD ‘/hdpcd/input/post16/post16.csv’ USING PigStorage() AS (line:chararray);

Above command is responsible for loading the data present in the post16.csv file into the pig relation input_data. The variable name is line and the datatype of line variable is chararray. This means that each line is denoted by line variable with chararray datatype.

filtered_data = FILTER input_data BY line IS NOT NULL;

Above command is responsible for removing the records which contain the NULL values in any of the columns. If you remember, our input file contains a total of 6 lines, out of which 2 are empty and 4 lines are non-empty. Therefore once we execute the above command, a total of 4 lines will remain and 2 lines will be removed.

STORE filtered_data INTO ‘/hdpcd/output/post16’;

This command is going to store the input data into the HDFS output directory /hdpcd/output/post16.

I hope the above explanation makes sense.

Once you download post16.pig file, you can use the following commands in the vi editor to create this file.

vi post16.pig

#####

PASTE THE CONTENTS HERE

#####

(esc):wq(enter)

cat post16.pig

The following screenshot might come handy.

pig script for NULL removal
pig script for NULL removal

So, we are ready to run this pig script.

  • RUNNING PIG SCRIPT

Now, it is time to run the above-created pig script. We will use the following command to run this pig script.

pig -x tez post16.pig

The below screenshot shows us the output of the above command.

running pig script
running pig script

Once the script, we will get some output and this output looks like this.

pig script output
pig script output

From the above screenshot, we can clearly see that our pig script ran successfully and a total of 4 records (4 lines) got stored in the output file under /hdpcd/output/post16 HDFS directory.

After this, let us look at the output.

  • OBSERVE THE OUTPUT

We will go through the HDFS directory /hdpcd/output/post16 and print the contents of the output file to see the results.

For doing this, we will run the following commands.

hadoop fs -ls /hdpcd/output/post16

hadoop fs -cat /hdpcd/output/post16/part-v000-o000-r-00000

Let us observe the output of the above two commands.

HDFS output file
HDFS output file

The above screenshot confirms that the output is as per our expectations and we can conclude this tutorial here.

I hope you liked the content. In the next tutorial, we are going to see how to store the output data in HDFS, like we are doing in every post.

Please follow my blog for further updates. You can click here to subscribe to my YouTube channel. You can like here my facebook page here and follow me on twitter here. Please check out my LinkedIn profile here.

Have fun people. Cheers!

Advertisements

One thought on “Post 16 | HDPCD | Removing records with NULL values from a Pig Relation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s