Hey everyone, thank you once again for keep on coming back to perform these tutorials.

In the last tutorial, we saw how to perform the simple JOIN Operation and in this tutorial, we are going to perform the REPLICATED JOIN Operation.  The process is similar and there is a difference only at one place, so we do not need to worry about it too much.

In the certification, they will specifically mention when you have to perform the Replicated JOIN. If they do not mention any type of join, then please use the previous tutorial to perform the simple JOIN operation.

The following infographics show the process of performing the Replicated JOIN Operation.

The Big Picture: REPLICATED JOIN in Apache PIG
The Big Picture: REPLICATED JOIN in Apache PIG

Now, the above picture clearly shows the process of performing this Replicated JOIN in Apache Pig. So, let us get started with all steps, with one step at a time.

  • CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM

We are going to use the traditional vi editor to create the input CSV files in the local file system.

The first file contains the customers’ data with name post23_customers.csv. I have uploaded this input file to my GitHub profile under HDPCD repository with name 32_customers_input.csv“. This input CSV file can be downloaded by clicking here and it looks something like this.

As you can see from the above snippet, this file is exactly similar to the one we saw in the previous tutorial.

Please use the following commands to create this input CSV file.

vi post23_customers.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23_customers.csv

The following screenshot shows the output of the above commands.

Step 1 :Creating customers input file in local file system
Step 1: Creating customers input file in local file system

The above screenshot shows that the customers’ input CSV file was created successfully in the local file system.

It is time to create the data file containing the orders information.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name 33_orders_input.csv“. It looks as follows and you can download this file by clicking here.

You can use the following commands to create this input CSV file in the local file system.

vi post23_orders.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23_orders.csv

And the output of these commands looks like as follows.

Step 2: creating orders input file in local file system
Step 2: creating orders input file in local file system

The above screenshot clearly shows that this input CSV file was created successfully in the local file system.

The next logical step is to push these two files to HDFS.

  • PUSHING CUSTOMERS AND ORDERS DATA TO HDFS

Please use the following commands to load these two input CSV files to HDFS.

hadoop fs -mkdir /hdpcd/input/post23
hadoop fs -put post23_customers.csv /hdpcd/input/post23
hadoop fs -put post23_orders.csv /hdpcd/input/post23
hadoop fs -cat /hdpcd/input/post23/post23_customers.csv
hadoop fs -cat /hdpcd/input/post23/post23_orders.csv

The following screenshot shows the output of the above commands.

Step 3: pushing input csv files to HDFS
Step 3: pushing input csv files to HDFS

The above screenshot shows that these two files were pushed to HDFS successfully.

The next step is to create the pig script to perform the REPLICATED JOIN between these two input CSV files.

  • CREATING PIG SCRIPT TO PERFORM THE REPLICATED JOIN

The pig script for this tutorial is exactly similar to the previous one with TWO ADDITIONAL KEYWORDS in the JOIN operation/command.

This pig script is uploaded to my GitHub profile under HDPCD repository with name 34_replicated_join.pig“. You can download this pig script by clicking here and it looks as follows.

The explanation and functionality of all the commands is similar to the last tutorial. You can refer to this tutorial for the explanation of this pig script.

The only different command is as follows.

joined_data = JOIN customers BY $0, orders BY $2 USING ‘replicated’;

In the above command, the keywords USING ‘replicated’ indicates that this is the REPLICATED JOIN operation. Therefore, pig will understand that instead of running the normal JOIN operation, Replicated JOIN operation should be performed.

I hope this explanation is enough to go ahead and create the pig script.

We can use the vi editor to create this pig script by using the following commands.

vi post23.pig

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23.pig

The following screenshot shows the execution of these commands.

Step 4: creating pig script to perform replicated join
Step 4: creating pig script to perform replicated join

Now that the pig script is created, it is time to run this pig script.

  • RUNNING PIG SCRIPT TO PERFORM THE REPLICATED JOIN

Please use the following command to run this pig script.

pig -x tez post23.pig

The following screenshot shows the execution process of this pig script.

Step 5: running pig script to perform replicated join
Step 5: running pig script to perform replicated join

And the output window of the pig script execution looks as follows.

Step 5: pig script execution output
Step 5: pig script execution output

From the above screenshot, you can see that this operation is a successful operation. A total of 25 records from the post23_customers.csv file and 13 records from the post23_orders.csv file was read. Finally, the output created 4 records, as expected.

Now, let us go to HDFS and view the contents of the output HDFS directory.

  • HDFS OUTPUT DIRECTORY CONTENTS

Please use the following commands to check the output contents stored in the HDFS directory /hdpcd/output/post23.

hadoop fs -ls /hdpcd/output/post23
hadoop fs -cat /hdpcd/output/post23/part-v001-o000-r-00000

And the output of these two commands looks as follows.

Step 6: output HDFS directory contents
Step 6: output HDFS directory contents

From the above screenshot, you can see that a total of 4 records were successfully created in the output HDFS directory.

This concludes the tutorial here. Hope you people are able to follow all the steps and come to conclusion about the objective of the tutorial.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on twitter here and subscribe to my YouTube channel here for the video tutorials.

Stay tuned. Cheers!

Advertisements

One thought on “Post 23 | HDPCD | Perform a REPLICATED JOIN using Apache Pig

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s