Post 21 | HDPCD| Specify number of reduce tasks for Pig MapReduce job

Hello, everyone. Thanks for coming back to one more tutorial in this HDPCD certification series. In the last tutorial, we saw how to remove the duplicate tuples from a pig relation. In this tutorial, we are going to see how to specify the number of reduce tasks for a Pig MapReduce job.

Let us get started then.

There are two ways of doing this. These two ways are mentioned in the following infographics.

 

parallel-features-of-apache-pig
Two ways: parallel-features-of-apache-pig

 

As you can see from the above figure, we can set the number of reduce tasks in two ways. The first being the Global Level and the second one is the Task Level.

We will see each level one by one. Let us start with the proceedings now.

  • CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM

We are going to use the traditional approach which we are using for most of the tutorials i.e. use of vi editor.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name 26_input_parallel_tasks.csv”. You can download this file by clicking here and it looks as follows.

Please use the following commands to create this file in local file system.

vi post21.csv

######

PASTE THE COPIED CONTENTS HERE

######

cat post21.csv

The following screenshot gives an idea about the execution of the above commands.

Step 1: creating input file in local file system
Step 1: creating input file in local file system
  • PUSHING INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS

Once the file is created successfully in the local file system, we push it to HDFS. We are going to use the following commands for executing this task.

hadoop fs -mkdir /hdpcd/input/post21
hadoop fs -put post21.csv /hdpcd/input/post21
hadoop fs -cat /hdpcd/input/post21/post21.csv

The above commands’ execution is shown in the following screenshot.

Step 2: pushing file from local file system to HDFS
Step 2: pushing file from local file system to HDFS

From the above screenshot, we can see that the file was successfully pushed to HDFS.

  • CREATING PIG SCRIPT

Now is the time to create the pig script.

We are going to create two pig scripts in this case, as already mentioned in the introductory part.

Just a tip, while performing this task in the certification exam, please pay close attention to the question and only use that approach which is asked to do.

If it is not mentioned in the question which one to use, then please go with the GLOBAL APPROACH and that is to use the SET command.

The first pig script is going to use the SET command to define the number of parallel reduce tasks.

I have uploaded this pig script to my GitHub profile under HDPCD repository with name 27_SET_multiple_reducers.pig”. You can download this file by clicking here and it looks as follows.

We can use the following commands to create this file.

vi post21_1.pig

######

PASTE THE COPIED CONTENTS HERE

######

cat post21_1.pig

The following screenshot gives an idea about the execution of these commands.

 

Step 3: pig script creation - 1
Step 3: pig script creation – 1

 

Now that we have created the first pig script, let us create the second script as well and then we will do the comparative study between these two scripts, as there is only one line different in those two files.

Talking about the second pig script, I have uploaded this pig script to my GitHub profile under HDPCD repository with name 28_PARALLEL_multiple_reducers.pig”. You can download this file by clicking here and it looks as follows.

Same commands, as shown above, are used to create this pig script.

vi post21_2.pig

######

PASTE THE COPIED CONTENTS HERE

######

cat post21_2.pig

The following screenshot shows this process of creating the pig script.

 

Step 4: pig script creation - 2
Step 4: pig script creation – 2

 

Now, as we have created both the pig scripts. Let us try to see the commonalities and differences between two scripts.

The common part between these two scripts is as follows.

input_data = LOAD ‘/hdpcd/input/post21/post21.csv’ USING PigStorage(‘,’);

The above LOAD command is used for loading the data from post21.csv. The input data is stored in the input_data pig relation.

Now, let us look at the differences between these two scripts.

PIG SCRIPT 1: SET default_parallel 4

PIG SCRIPT 1: sorted_data = ORDER input_data BY $6 DESC;

PIG SCRIPT 2: sorted_data = ORDER input_data BY $6 DESC PARALLEL 6;

As you can see from the above code snippet, the first script uses the global operator for defining the number of reduce tasks, in this case, 4. It does not matter what you do in the script, if you have mentioned the number of reducer tasks globally, then those many part files are going to be created eventually.

Whereas, in the case of the second script, we are mentioning the number of reduce tasks for a particular operation, in this case, ORDER operation. This enables the pig script to launch 6 reduce tasks in parallel, creating 6 part files in the output HDFS directory.

I hope this explanation makes sense and give you the overall idea about the parallel functionality of the pig script execution.

Lastly, the output is stored in the output HDFS directory with the help of STORE command. We are using different directories for these two scripts and the field delimiter is also different for both of those scripts.

The STORE command looks as follows for these two scripts.

SCRIPT 1: STORE sorted_data INTO ‘/hdpcd/output/post21_1’ USING PigStorage(‘:’);

SCRIPT 2: STORE sorted_data INTO ‘/hdpcd/output/post21_2’ USING PigStorage(‘;’);

This completes the pig script creation and execution.

Now is the time to execute both of these pig scripts. We will see the execution of these two scripts one by one.

  • PIG SCRIPT EXECUTION – 1

You can use the following command to run the first pig script post21_1.pig.

pig post21_1.pig

The execution of this command looks as follows.

Step 5: running pig script type 1
Step 5: running pig script type 1

And the execution of this script looks as follows.

Step 5: pig script execution output type 1
Step 5: pig script execution output type 1

From the above output screenshot, we can see that this pig script execution was successful. We did not lose any records and a total of 5 records were written to the HDFS directory /hdpcd/output/post21_1.

This completes the execution part of the first pig script.

  • PIG SCRIPT EXECUTION – 2

Coming to the second pig script execution, it is similar to the first one. You can use the following command to run this pig script.

pig post21_2.pig

The execution of the above command looks as follows.

Step 6: running pig script type 2
Step 6: running pig script type 2

And the output of this script is as follows.

Step 6: pig script execution output type 2
Step 6: pig script execution output type 2

As you can see from the above screenshot, this operation was also successful and the output HDFS directory /hdpcd/output/post21_2 is loaded with 5 records.

This completes the pig script execution.

Now let us go to HDFS and see how many part files are created in each HDFS directory.

  • OUTPUT HDFS DIRECTORY – 1

We can use the following commands to view the content of the first output HDFS directory.

hadoop fs -ls /hdpcd/output/post21_1

hadoop fs -cat /hdpcd/output/post21_1/*

Above commands give the following output.

Step 7: HDFS output directory type 1
Step 7: HDFS output directory type 1

As we can see that a total of 4 part files and a _SUCCESS file was created, as expected. This confirms that the SET default_parallel 4 command worked perfectly.

  • OUTPUT HDFS DIRECTORY – 2

The following commands are used for checking the second output HDFS directory.

hadoop fs -ls /hdpcd/output/post21_2
hadoop fs -cat /hdpcd/output/post21_2/*

The screenshot of the above commands’ execution is as follows.

Step 8: HDFS output directory type 2
Step 8: HDFS output directory type 2

The above screenshot suggests that a total of 6 part files and a _SUCCESS file was created after executing the second pig script, fulfilling the expectations.

Here, we can conclude this tutorial saying that both SET and PARALLEL commands work as expected. It is up to you which one you want to use during the certification exam. If you want to know my opinion, I would go with the SET command as I would not have to worry about where to put the PARALLEL command in the pig script.

I hope this tutorial is helpful and making a sense to you.

Please reach out to me for any help that you might need.

Please follow my blog for further updates. You can check out my LinkedIn profile here. Please subscribe to my YouTube channel here. You can like my Facebook page here and follow me on Twitter here.

Hope to see you soon and in the next tutorial, we are going to perform the most-awaited JOIN operation. Stay tuned.

Cheers!

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s