Hi everyone, thanks for coming back again to continue with this tutorial series. We are almost there with this section, and once we are done with this, we will jump into Hive, which will not take much time.
In the last tutorial, we saw the process to store the data from PIG to HIVE using HCatalog. That was a tutorial and based on the feedback that I got, this is going to be smaller as compared to the previous tutorial. In this tutorial, we are going to see how to do the SORT operation on the data stored in a PIG relation.
Let us see the global picture of this tutorial.
Let us get started with each step.
- CREATING INPUT FILE IN LOCAL FILE SYSTEM
We are going to use the vi editor, which we have done in the past, for creating the input CSV file.
We have used this input file in the past, so I assume that you are familiar with the structure. For the sake of this post, we are going to perform the SORT operation in DESCENDING order on the maximum temperature column which is at 6th place.
The following commands can be used for creating this input file in local file system.
PASTE CONTENTS HERE
The following screenshot shows the execution of the above commands.
The file is ready in local file system. It is time to put it in HDFS.
- PUSHING FILE FROM LOCAL FILE SYSTEM TO HDFS
Once the file is created in local file system, it is put to HDFS since it is going to be taken from the HDFS location by Apache PIG.
The following commands are used for achieving this.
hadoop fs -mkdir /hdpcd/input/post19
hadoop fs -put post19.csv /hdpcd/input/post19
hadoop fs -cat /hdpcd/input/post19/post19.csv
The following screenshot shows the output of the above commands.
Above screenshot shows that the file was successfully pushed to HDFS.
Let us work on the PIG script now.
- CREATING PIG SCRIPT
The objective of the PIG script is to perform the DESCENDING SORTING on the maximum temperature column. For doing this, we are going to use the following PIG script. I have uploaded this input file to my GitHub profile under HDPCD repository with the name “23_sort_in_pig.pig“ and you can download it by clicking here. This input file looks as follows.
As you can see, the above PIG script contains 3 commands. We will see these commands one by one as follows.
input_data = LOAD ‘/hdpcd/input/post19/post19.csv’ USING PigStorage(‘,’) AS (station_name:chararray, year:int, month:int, dayofmonth:int, precipitation:int, maxtemp:int, mintemp:int);
The above LOAD command is used for loading the data stored in post19.csv file into the PIG relation called input_data. We are passing the custom schema along with this LOAD statement, out of which maxtemp column indicates the maximum temperature.
sorted_data = ORDER input_data BY maxtemp DESC;
ORDER command is responsible for performing the SORT operation. We are passing the column name on which the SORT operation is performed i.e. maxtemp in this case. The keyword DESC indicates that the SORT operation should be performed in the descending order, therefore higher temperature will appear at the top whereas the lower temperatures will appear at the bottom.
STORE sorted_data INTO ‘/hdpcd/output/post19’;
STORE command is going to store the data stored in sorted_data PIG relation into the HDFS directory /hdpcd/output/post19. The default delimiter is going to be TAB character.
If this explanation is understood, let us have a look at the process of creating this PIG script. vi editor comes to rescue again to create this file and we follow the below commands.
PASTE CONTENTS HERE
The following screenshot might come handy while executing these commands.
Once the PIG script is created, it is time to run it.
- RUNNING PIG SCRIPT
We use the following command to run this PIG script.
pig -x tez post19.pig
The initial execution of this command looks as follows.
And the final output of this command execution looks as like this.
As you can see from the above screenshot, the SORT operation was performed successfully. A total of 5 records were read from /hdpcd/input/post19/post19.csv and same number of records were created in the output directory /hdpcd/output/post19. This indicates that it was a successful operation.
Before concluding, let us take a look at the output HDFS directory.
We can check the output HDFS directory with the help of the following commands.
hadoop fs -ls /hdpcd/output/post19
hadoop fs -cat /hdpcd/output/post19/part-v003-o000-r-00000
The execution of these commands looks as follows.
As you can see from the above screenshot, the output data is sorted according to the DESCENDING order of the maximum temperature. This is a proof that our objective was achieved successfully and we can conclude this tutorial here.
I hope the screenshots, explanation and the textual information are helping you to understand each tutorial. Please stay tuned for the further updates.
In the next tutorial, we are going to see how to remove the duplicate tuples (records) from the PIG relation.
Please follow my blog for receiving regular updates. You can subscribe to my YouTube channel for the video tutorials by clicking here. You can like my Facebook page here. You can check out my LinkedIn profile here and follow me on twitter here.
Hope to see you soon. Cheers!