Post 24 | HDPCD | Run a Pig job using TEZ

Leave a comment

May 29, 2017 by milindjagre

Hey, everyone. Thank you for giving me company on this beautiful journey of HDPCD certification. We are almost done with the Data Transformation section of the certification and are only left with Data Analysis section using Apache Hive. The section of Data Analysis, in my opinion, is easier than this section so you can say that the difficult part of the certification is over with only hive left for the preparation.

Let us get started then.

In the previous tutorial of this series, we saw the process of performing the Replicated JOIN in Apache Pig. This tutorial is different than the last one. In this, we are going to see how to run a pig script in the TEZ execution mode.

For doing this, we are going to follow the below steps.

Apache Pig TEZ execution mode

Apache Pig TEZ execution mode

As you can see from the above screenshot, we are not doing anything extraordinary in this tutorial. Our focus is to run this pig script using TEZ as the execution mode.

Let us get started then.

  • CREATING INPUT FILE IN LOCAL FILE SYSTEM

We are going to use the traditional vi editor to create the input file in local file system.
I have uploaded this file to my GitHub profile under HDPCD repository with the name 35_input_TEZ_mode.txt“. You can download this file by clicking here and it looks as follows.

You can use the following commands to create this input file.

vi post24.txt

################################
PASTE THE COPIED CONTENTS HERE
################################

cat post24.txt

The following screenshot shows the execution of the above commands.

Step 1: creating input file in local file system

Step 1: creating input file in local file system

As you can see from the above screenshot, this input file was successfully created in the local file system.

Let us push this file to HDFS now.

  • PUSHING THE INPUT FILE FROM LOCAL FILE SYSTEM TO HDFS

We are going to use the following set of commands to push this input file from the local file system to HDFS.

hadoop fs -mkdir /hdpcd/input/post24
hadoop fs -put post24.txt /hdpcd/input/post24
hadoop fs -cat /hdpcd/input/post24/post24.txt

And the output of these commands is shown in the following screenshot.

Step 2: pushing the input file from local file system to HDFS

Step 2: pushing the input file from local file system to HDFS

This confirms that the file was successfully pushed to HDFS.

Let us create the pig script now.

  • CREATING PIG SCRIPT TO RUN THE PIG EXECUTION MODE

This pig script is executed in TEZ execution mode.
I have uploaded this pig script to my GitHub profile under HDPCD repository with the name 36_pig_script_tez_mode.pig“. You can download this file by clicking here and it looks as follows.

Let me explain this pig script in brief, though there is not much to explain.

input_data = LOAD ‘/hdpcd/input/post24/post24.txt’ USING PigStorage();

LOAD command is used to load the contents in file post24.txt into the pig relation input_data.

STORE input_data INTO ‘/hdpcd/output/post24’;

STORE command is used to load the data stored in input_data pig relation into an HDFS directory “/hdpcd/output/post24“.

I believe this is the simplest explanation of the pig script that we are going to create with the help of the following commands.

vi post24.pig

################################
PASTE THE COPIED CONTENTS HERE
################################

cat post24.pig

The following screenshot shows the execution of the above commands.

Step 3: Creating the pig script to run in TEZ mode

Step 3: Creating the pig script to run in TEZ mode

The above screenshot confirms that the pig script was created successfully.

Let us go ahead and execute this pig script in TEZ execution mode.

  • RUNNING THE PIG SCRIPT IN TEZ EXECUTION MODE

We are going to run this pig script in the TEZ execution mode with the help of the following command.

pig -x tez post24.pig

The pig script starts the execution in the following way.

Step 4: Running the pig script with TEZ execution mode

Step 4: Running the pig script with TEZ execution mode

As you can see in the above screenshot, the execution mode is TEZ.

The output of the above command looks as follows.

Step 4: Pig Script TEZ mode output window

Step 4: Pig Script TEZ mode output window

From the above screenshot, it is clear that the pig script ran successfully with the expected output. The screenshot suggests that a total of 3 records were read from the input HDFS file post24.txt and the output directory /hdpcd/output/post24 was loaded with a total of 3 records.

Let us go to HDFS and view these output records.

  • HDFS OUTPUT DIRECTORY

The following commands are used to view the records stored in the output HDFS directory.

hadoop fs -ls /hdpcd/output/post24
hadoop fs -cat /hdpcd/output/post24/part-v000-o000-r-00000

The following screenshot shows the output of the above commands.

Step 5: Output HDFS directory contents

Step 5: Output HDFS directory contents

As you can see from the above screenshot, the output HDFS file contains the expected number of records in the expected format. This concludes that pig script ran successfully in the TEZ execution mode.

Hope the content and the screenshot makes sense and are helping you to understand the concepts to the core. Please follow my blog for the further updates.
In the next tutorial, we are going to see how to register a jar file in Apache PIG session.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on twitter here and subscribe to my YouTube channel here for the video tutorials.

Stay tuned for upcoming posts.

Cheers!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: