Post 22 | HDPCD | Join two datasets using Apache Pig

Hey, everyone. Thanks for the overwhelming response to the blog posts that I am receiving since the last week. I really appreciate it. I will keep on posting interesting and innovative contents for you.

In the last tutorial, we saw how to use the parallel features of Apache Pig in two ways. In this tutorial, we are going to focus on the JOIN operation and will see the process of performing this JOIN operation between the two datasets to produce the required output.

Let us start then.

Big Picture: join-two-datasets-using-apache-pig
Big Picture: join-two-datasets-using-apache-pig

As you can see from the above picture, the process we are following to achieve this is quite similar and traditional to the one which we are using for almost all the tutorials.

We will start off by creating the input CSV files first.

  • CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM

As already done in the past, the input CSV files are created with the help of vi editor in the local file system.

Since two data sets are involved in this tutorial, we are going to create two input CSV files in the local file system.

I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are 29_customers_input.csv and 30_orders_input.csv“.

The file post22_customers.csv looks as follows and you can download it from by clicking here.

We can use the traditional vi editor for creating this file. Please use the following commands for doing this.

vi post22_customers.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22_customers.csv

The following screenshot might come handy for running these commands.

Step 1: creating customers input file in local file system
Step 1: creating customers input file in local file system

Now, it is time for the second input file.

The file post22_orders.csv looks as follows and you can download it from by clicking here.

The same vi editor is used for creating this input file.

vi post22_orders.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22_orders.csv

Please have a look at the below screenshot for the execution of the above commands.

Step 2: creating orders input file in local file system
Step 2: creating orders input file in local file system
  • PUSHING INPUT CSV FILES TO HDFS

Now is the time to push these input CSV files to HDFS.

We are going to use the following set of commands for doing this.

hadoop fs -mkdir /hdpcd/input/post22
hadoop fs -put post22_customers.csv /hdpcd/input/post22
hadoop fs -put post22_orders.csv /hdpcd/input/post22
hadoop fs -cat /hdpcd/input/post22/post22_customers.csv
hadoop fs -cat /hdpcd/input/post22/post22_orders.csv

The following screenshot might come handy to keep track of the execution of the above commands.

Step 3: pushing input CSV files from local file system to HDFS
Step 3: pushing input CSV files from local file system to HDFS

Now that both the input CSV files are pushed to HDFS, it is time to create the pig script responsible for performing the JOIN operation between customers and orders data.

  • CREATING PIG SCRIPT FOR JOIN OPERATION

I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are 31_join_operation.pig“. 

This pig script looks as follows and you can download this file by clicking here.

Let us go through each line in this pig script.

customers = LOAD ‘/hdpcd/input/post22/post22_customers.csv’ USING PigStorage(‘,’);

Above LOAD command is used for loading the data stored in post22_customers.csv file into a pig relation called customers.

orders = LOAD ‘/hdpcd/input/post22/post22_orders.csv’ USING PigStorage(‘,’);

Above LOAD command is used for loading the data stored in post22_orders.csv file into a pig relation called orders.

joined_data = JOIN customers BY $0, orders BY $2;

Above JOIN command contains the gist of this tutorial. JOIN command is used for performing the JOIN operation in Apache Pig. The syntax is quite simple. You must include all the tables that you want to be a part of the JOIN operation along with the columns on which you want to perform the JOIN operation. The columns are mentioned with the help of their indices in the customers and orders pig relation. The result is stored in the pig relation called joined_data.

output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;

The above FOREACH statement is used for iterating over the joined_data pig relation. As the final output should contain the first name, last name, order ID, and the payment status, only those columns are extracted from the joined_data pig relation. This result is stored in the output_data pig relation.

STORE output_data INTO ‘/hdpcd/output/post22/’;

The output_data pig relation is then stored into HDFS directory /hdpcd/output/post22 with the help of STORE command.

I believe, the above explanation is sufficient to move forward with the creation and execution of the pig script.

You can use the following commands to create this pig script.

vi post22.pig

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22.pig

The below screenshot might come handy for doing this.

Step 4: creating pig script for join operation
Step 4: creating pig script for join operation

It is time to run this pig script.

  • RUNNING THE PIG SCRIPT

You can use the following command to run this pig script.

pig -x tez post22.pig

The pig script starts the execution in the following way.

Step 5: pig script execution command
Step 5: pig script execution command

And the output of this pig script looks as follows.

Step 5: pig script execution output
Step 5: pig script execution output

You can see from the above screenshot that it was a successful operation. A total of 12 records were read from the customers file and 13 records were read from the orders file. The JOIN operation caused only 4 records in the output HDFS directory as there were only 4 matching entries for the order ID in both the files.

Now, the last thing remaining is to view the data stored in HDFS output directory.

  • OUTPUT HDFS DIRECTORY

For viewing the data stored in the output HDFS directory, we are going to use the following two commands.

hadoop fs -ls /hdpcd/output/post22
hadoop fs -cat /hdpcd/output/post22/part-v002-o000-r-00000

The following screenshot shows the output of these two commands.

Step 6: output HDFS directory contents
Step 6: output HDFS directory contents

As you can see from the above screenshot, the output file contains a total of 4 records, as expected. The output file contains the first name, last name, order ID, and the payment status, as expected.

With this, we can say that the objective of this tutorial is met and we can conclude this section here.

I hope the explanation and the screenshot are helping you to understand the concepts of each tutorial. Stay tuned for the further updates.

In the next tutorial, we are going to perform the replicated join using apache pig.

Please visit my LinkedIn profile here. Like my Facebook page here and follow me on twitter here. You can subscribe to my YouTube channel here.

See you soon.

Cheers!

 

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s