Post 15 | HDPCD | Group Data in one or more PIG Relations

Hello everyone, thanks for coming back for the one more tutorial in this HDPCD certification series. In the last tutorial, we saw the process to transform the input data to match the hive schema. This tutorial focuses on the next functionality provided by Apache Pig – “GROUP” operation between one or more pig relations.

The GROUP operation in Apache PIG is quite similar to the SQL’s GROUP operation. Grouping can be performed based on a one or more column values. The output Pig relation contains only those records for which the columns passed in the GROUP clause have matching values. To explain this with an example, we are going to GROUP the weather data based on the station name. Let us start with this, then you will know what I am talking about.

The input data is uploaded to my GitHub profile under HDPCD repository and you can download it by clicking here. For your reference, this input data post15.csv looks as follows.

Above shown csv file can be created with the help of vi editor and we use the following commands to do that.

vi post15.csv


PASTE post15.csv contents here

save this by pressing (esc):wq(enter)


cat post15.csv

Following screenshot might help you to understand above commands.

pig group operation input file
pig group operation input file

Now, since we are going to run the pig script in the TEZ MODE, we must put this file in the HDFS. We are going to use the following set of commands to achieve this.

hadoop fs -mkdir /hdpcd/input/post15

hadoop fs -put post15.csv /hdpcd/input/post15

hadoop fs -cat /hdpcd/input/post15/post15.csv

Please have a look at the following screenshot to get the idea about the execution of above commands.

pushing input file in HDFS
pushing input file in HDFS

This completes the input csv file loading operation.

As you can see this input file contains seven columns, which are explained as follows.

  • Column 1: Station Name
  • Column 2: Year
  • Column 3: Month
  • Column 4: Day
  • Column 5: Precipitation
  • Column 6: Maximum Temperature
  • Column 7: Minimum Temperature

From above list of columns, we will perform the group operation based on the 1st column: Station Name. Therefore, if you observe all the stations in the input data file, stations LAX and DEN are repeated twice whereas SFO occurs only one. So, to set the output expectations right, we must have three records in the output file, one each for SFO, LAX, and DEN stations.

Now that the expectations are set for the output, let us start writing the pig script for doing the group operation. I have uploaded this script file to my GitHub profile under HDPCD repository with name 13_group_in_pig.pig. You can click here to download this pig script. It looks something like this.

The above snippet indicates that there are a total of 4 commands to carry out this operation. Let us go through each command one by one.

weather = LOAD ‘/hdpcd/input/post15/post15.csv’ USING PigStorage(‘,’);

Above command is used for loading the input csv file in HDFS in weather pig relation. Since it is a csv file, we have used PigStorage(‘,’) for correctly loading the file.

grouped_data = GROUP weather BY $0;

This is the command which causes the grouping in pig. GROUP command makes sure that the group operation is carried out on the input data based on some column. In this case, we are passing it to be the first column and therefore we have given $0 in the above command.

output_data = FOREACH grouped_data GENERATE group,weather;

Above command, basically, created the output pig relation. The output data contains the group i.e. station name and their corresponding information.

STORE output_data INTO ‘/hdpcd/output/post15/’;

STORE command is responsible for storing the contents of the output_data relation in HDFS under /hdpcd/output/post15/ directory.

This indicates that out output files should be in /hdpcd/output/post15/.

Now, let us create this pig script and run it.

We use same vi editor for creating this pig script and then the pig command to run this file.

vi post15.pig


PASTE post15.pig contents here

save this by pressing (esc):wq(enter)


cat post15.pig

For your reference, please have a look at the below screenshot.

creating pig script for group operation
creating pig script for group operation

Now, since the pig script is ready now, it is time to run this script file, and we are going to use the following command to run this pig script.

pig -x tez post15.pig

The execution of this command looks as follows.

running pig script for group operation
running pig script for group operation

The output of this pig script looks as follows.

pig script output for group operation
pig script output for group operation

As you can see from the above screenshot, a total of 5 records were taken as input from the input post15.csv file and 3 records were created as the output in the defined HDFS output directory /hdpcd/output/post15, as expected.

We have also received the success message in the above image. Therefore, now it is time to check the output records.

Following commands are used for checking the status of the HDFS output directory.

 hadoop fs -ls /hdpcd/output/post15

hadoop fs -cat /hdpcd/output/post15/part-v001-o000-r-00000

The output of above commands is shown in below screenshot.

group operation output in pig
group operation output in pig

As you can see in the above screenshot, there is a total of 3 records grouped according to the station names i.e. DEN, LAX, and SFO.

This confirms that the output is coming as expected and we can conclude this tutorial here.

Hope you are getting the concepts which I want to convey through these tutorials.

In the next tutorial, we are going to see how to remove the NULL values from a pig relation.

You can click here to subscribe to my YouTube channel for the video tutorials. You can like my Facebook page here and follow me on twitter here.

Stay tuned for the further updates. Thanks for having a read.



One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s