Post 20 | HDPCD | Removing Duplicate tuples from a PIG Relation

Hi everyone, welcome to one more tutorial in this HDPCD certification series. As you might notice, I have changed the blog layout a little bit, hope you like it. Kindly let me know your feedback on this in the COMMENT SECTION.

In the last tutorial, we saw how to perform the SORT OPERATION in Apache PIG. In this tutorial, we are going to remove the duplicate tuples from a Pig Relation.

Let us start with the tutorial then.

We are going to follow the below steps.

Tutorial: Big Picture Walkthrough
Tutorial: Big Picture Walkthrough

As you can see, we are following the same pattern like most of the tutorials in this series.

Let us get started with the first step then.

  • CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM

We are going to use the vi editor to create this input CSV file.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with the name “24_input_for_removing_duplicates.csv” and you can download it by clicking here. This file looks as follows.

Please follow below commands to create this input CSV file in local file system.

vi post20.csv

######

PASTE THE ABOVE CONTENTS HERE

######

cat post20.csv

The following screenshot will be able to help you out regarding the execution of above commands.

step 1: creating input file in local file system
step 1: creating input file in local file system

Above screenshot indicates that the input CSV file was created in the local file system successfully.

Now it is time to push this input file to HDFS.

  • PUSHING INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS

We are going to use the following commands to load this post20.csv from the local file system to HDFS.

hadoop fs -mkdir /hdpcd/input/post20
hadoop fs -put post20.csv /hdpcd/input/post20
hadoop fs -cat /hdpcd/input/post20/post20.csv

The below screenshot shows the output of the above commands.

step 2: pushing input file from local file system to HDFS
step 2: pushing input file from local file system to HDFS

Above screenshot indicates that the input CSV file was successfully pushed to HDFS.

Next thing to do is to create the pig script.

  • CREATE PIG SCRIPT TO REMOVE DUPLICATE TUPLES FROM PIG RELATION

Once the input CSV file is ready in HDFS, it is time to create the pig script responsible for removing the duplicate tuples from the pig relation.

I have uploaded this pig script to my GitHub profile under HDPCD repository with the name 25_removing_duplicates.pig and you can download it by clicking here. This file looks as follows.

Let me explain this script briefly.

input_data = LOAD ‘/hdpcd/input/post20/post20.csv’ USING PigStorage(‘,’);

LOAD command is used to load the data stored in the post20.csv file in input_data pig relation. We are not passing any custom schema while creating this pig relation.

unique_data = DISTINCT input_data;

DISTINCT command is used for removing the duplicate tuples from a pig relation. This filtered data is then loaded into a new pig relation with name unique_data.

STORE unique_data INTO ‘/hdpcd/output/post20’;

Finally, the output data in unique_data relation is stored in /hdpcd/output/post20 HDFS directory with the help of STORE command.

Hope this explanation helps.

Please follow below commands to create this pig script.

vi post20.pig

#####

PASTE THE COPIED CONTENTS HERE

#####

cat post20.pig

The below screenshot comes handy for this.

step 3: creating PIG script to remove duplicate tuples
step 3: creating PIG script to remove duplicate tuples

Above screenshot shows that the pig script was created successfully.

It is time now to run this pig script.

  • RUN PIG SCRIPT

Please use the below command to run this pig script.

pig -x tez post20.pig

The following screenshot shows the process of running this pig script.

step 4: running pig script for removing duplicates
step 4: running pig script for removing duplicates

And the output of this script execution looks like this.

step 4: pig script execution output
step 4: pig script execution output

As you can see from the above screenshot, the pig script executed successfully.

A total of 7 records were sent as an input to the script and the output generated only 5 records, as expected, removing the duplicate records in the input CSV file.

This confirms that we were able to perform this objective successfully.

Let us take a look at the output HDFS directory.

  • OBSERVE THE OUTPUT FOR REMOVAL OF DUPLICATE TUPLES

The following commands are used to check the output HDFS directory.

hadoop fs -ls /hdpcd/output/post20
hadoop fs -cat /hdpcd/output/post20/part-v001-o000-r-00000

The below screenshot indicates the execution of the above commands.

step 5: HDFS output directory contents
step 5: HDFS output directory contents

As you can see from the above screenshot, the duplicate tuples were deleted/removed from the input file and we are getting all the unique records in the output HDFS file.

We can conclude this tutorial right here. Hope you guys like the content and explanation. In the next tutorial, we are going to see how to specify the number of reduce tasks for a pig MapReduce job.

Stay tuned for the updates.

Please follow my blog for receiving regular updates. You can subscribe to my YouTube channel for the video tutorials by clicking here. You can like my Facebook page here. You can check out my LinkedIn profile here and follow me on twitter here.

 

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s