Post 19 | HDPCD | Sort the output of a Pig Relation

Hi everyone, thanks for coming back again to continue with this tutorial series. We are almost there with this section, and once we are done with this, we will jump into Hive, which will not take much time.

In the last tutorial, we saw the process to store the data from PIG to HIVE using HCatalog. That was a tutorial and based on the feedback that I got, this is going to be smaller as compared to the previous tutorial. In this tutorial, we are going to see how to do the SORT operation on the data stored in a PIG relation.

Let us see the global picture of this tutorial.

Tutorial: Global Picture
Tutorial: Global Picture

Let us get started with each step.

  • CREATING INPUT FILE IN LOCAL FILE SYSTEM

We are going to use the vi editor, which we have done in the past, for creating the input CSV file.

I have uploaded this input file to my GitHub profile under HDPCD repository with the name 22_input_for_sort.csv and you can download it by clicking here. This input file looks as follows.

We have used this input file in the past, so I assume that you are familiar with the structure. For the sake of this post, we are going to perform the SORT operation in DESCENDING order on the maximum temperature column which is at 6th place.

The following commands can be used for creating this input file in local file system.

vi post19.csv

######

PASTE CONTENTS HERE

######

cat post19.csv

The following screenshot shows the execution of the above commands.

Step 1: creating input file in local file system
Step 1: creating input file in local file system

The file is ready in local file system. It is time to put it in HDFS.

  • PUSHING FILE FROM LOCAL FILE SYSTEM TO HDFS

Once the file is created in local file system, it is put to HDFS since it is going to be taken from the HDFS location by Apache PIG.

The following commands are used for achieving this.

hadoop fs -mkdir /hdpcd/input/post19
hadoop fs -put post19.csv /hdpcd/input/post19
hadoop fs -cat /hdpcd/input/post19/post19.csv

The following screenshot shows the output of the above commands.

Step 2: pushing local file into HDFS
Step 2: pushing local file into HDFS

Above screenshot shows that the file was successfully pushed to HDFS.

Let us work on the PIG script now.

  • CREATING PIG SCRIPT

The objective of the PIG script is to perform the DESCENDING SORTING on the maximum temperature column. For doing this, we are going to use the following PIG script. I have uploaded this input file to my GitHub profile under HDPCD repository with the name 23_sort_in_pig.pig and you can download it by clicking here. This input file looks as follows.

As you can see, the above PIG script contains 3 commands. We will see these commands one by one as follows.

input_data = LOAD ‘/hdpcd/input/post19/post19.csv’ USING PigStorage(‘,’) AS (station_name:chararray, year:int, month:int, dayofmonth:int, precipitation:int, maxtemp:int, mintemp:int);

The above LOAD command is used for loading the data stored in post19.csv file into the PIG relation called input_data. We are passing the custom schema along with this LOAD statement, out of which maxtemp column indicates the maximum temperature.

sorted_data = ORDER input_data BY maxtemp DESC;

ORDER command is responsible for performing the SORT operation. We are passing the column name on which the SORT operation is performed i.e. maxtemp in this case. The keyword DESC indicates that the SORT operation should be performed in the descending order, therefore higher temperature will appear at the top whereas the lower temperatures will appear at the bottom.

STORE sorted_data INTO ‘/hdpcd/output/post19’;

STORE command is going to store the data stored in sorted_data PIG relation into the HDFS directory /hdpcd/output/post19. The default delimiter is going to be TAB character.

If this explanation is understood, let us have a look at the process of creating this PIG script. vi editor comes to rescue again to create this file and we follow the below commands.

vi post19.pig

######

PASTE CONTENTS HERE

######

cat post19.pig

The following screenshot might come handy while executing these commands.

Step 3: creating pig script to sort data
Step 3: creating pig script to sort data

Once the PIG script is created, it is time to run it.

  • RUNNING PIG SCRIPT

We use the following command to run this PIG script.

pig -x tez post19.pig

The initial execution of this command looks as follows.

Step 4: running pig script to sort the data
Step 4: running pig script to sort the data

And the final output of this command execution looks as like this.

Step 4: pig script execution output
Step 4: pig script execution output

As you can see from the above screenshot, the SORT operation was performed successfully. A total of 5 records were read from /hdpcd/input/post19/post19.csv and same number of records were created in the output directory /hdpcd/output/post19. This indicates that it was a successful operation.

Before concluding, let us take a look at the output HDFS directory.

  • OUTPUT

We can check the output HDFS directory with the help of the following commands.

hadoop fs -ls /hdpcd/output/post19
hadoop fs -cat /hdpcd/output/post19/part-v003-o000-r-00000

The execution of these commands looks as follows.

Step 5: output HDFS directory contents
Step 5: output HDFS directory contents

As you can see from the above screenshot, the output data is sorted according to the DESCENDING order of the maximum temperature. This is a proof that our objective was achieved successfully and we can conclude this tutorial here.

I hope the screenshots, explanation and the textual information are helping you to understand each tutorial. Please stay tuned for the further updates.

In the next tutorial, we are going to see how to remove the duplicate tuples (records) from the PIG relation.

Please follow my blog for receiving regular updates. You can subscribe to my YouTube channel for the video tutorials by clicking here. You can like my Facebook page here. You can check out my LinkedIn profile here and follow me on twitter here.

Hope to see you soon. Cheers!

 

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s