Hello, everyone. Thanks for coming back to one more tutorial in this HDPCD certification series. In the last tutorial, we saw how to remove the duplicate tuples from a pig relation. In this tutorial, we are going to see how to specify the number of reduce tasks for a Pig MapReduce job.
Let us get started then.
There are two ways of doing this. These two ways are mentioned in the following infographics.
As you can see from the above figure, we can set the number of reduce tasks in two ways. The first being the Global Level and the second one is the Task Level.
We will see each level one by one. Let us start with the proceedings now.
- CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM
We are going to use the traditional approach which we are using for most of the tutorials i.e. use of vi editor.
Please use the following commands to create this file in local file system.
PASTE THE COPIED CONTENTS HERE
The following screenshot gives an idea about the execution of the above commands.
- PUSHING INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS
Once the file is created successfully in the local file system, we push it to HDFS. We are going to use the following commands for executing this task.
hadoop fs -mkdir /hdpcd/input/post21
hadoop fs -put post21.csv /hdpcd/input/post21
hadoop fs -cat /hdpcd/input/post21/post21.csv
The above commands’ execution is shown in the following screenshot.
From the above screenshot, we can see that the file was successfully pushed to HDFS.
- CREATING PIG SCRIPT
Now is the time to create the pig script.
We are going to create two pig scripts in this case, as already mentioned in the introductory part.
Just a tip, while performing this task in the certification exam, please pay close attention to the question and only use that approach which is asked to do.
If it is not mentioned in the question which one to use, then please go with the GLOBAL APPROACH and that is to use the SET command.
The first pig script is going to use the SET command to define the number of parallel reduce tasks.
We can use the following commands to create this file.
PASTE THE COPIED CONTENTS HERE
The following screenshot gives an idea about the execution of these commands.
Now that we have created the first pig script, let us create the second script as well and then we will do the comparative study between these two scripts, as there is only one line different in those two files.
Talking about the second pig script, I have uploaded this pig script to my GitHub profile under HDPCD repository with name “28_PARALLEL_multiple_reducers.pig”. You can download this file by clicking here and it looks as follows.
Same commands, as shown above, are used to create this pig script.
PASTE THE COPIED CONTENTS HERE
The following screenshot shows this process of creating the pig script.
Now, as we have created both the pig scripts. Let us try to see the commonalities and differences between two scripts.
The common part between these two scripts is as follows.
input_data = LOAD ‘/hdpcd/input/post21/post21.csv’ USING PigStorage(‘,’);
The above LOAD command is used for loading the data from post21.csv. The input data is stored in the input_data pig relation.
Now, let us look at the differences between these two scripts.
PIG SCRIPT 1: SET default_parallel 4
PIG SCRIPT 1: sorted_data = ORDER input_data BY $6 DESC;
PIG SCRIPT 2: sorted_data = ORDER input_data BY $6 DESC PARALLEL 6;
As you can see from the above code snippet, the first script uses the global operator for defining the number of reduce tasks, in this case, 4. It does not matter what you do in the script, if you have mentioned the number of reducer tasks globally, then those many part files are going to be created eventually.
Whereas, in the case of the second script, we are mentioning the number of reduce tasks for a particular operation, in this case, ORDER operation. This enables the pig script to launch 6 reduce tasks in parallel, creating 6 part files in the output HDFS directory.
I hope this explanation makes sense and give you the overall idea about the parallel functionality of the pig script execution.
Lastly, the output is stored in the output HDFS directory with the help of STORE command. We are using different directories for these two scripts and the field delimiter is also different for both of those scripts.
The STORE command looks as follows for these two scripts.
SCRIPT 1: STORE sorted_data INTO ‘/hdpcd/output/post21_1’ USING PigStorage(‘:’);
SCRIPT 2: STORE sorted_data INTO ‘/hdpcd/output/post21_2’ USING PigStorage(‘;’);
This completes the pig script creation and execution.
Now is the time to execute both of these pig scripts. We will see the execution of these two scripts one by one.
- PIG SCRIPT EXECUTION – 1
You can use the following command to run the first pig script post21_1.pig.
The execution of this command looks as follows.
And the execution of this script looks as follows.
From the above output screenshot, we can see that this pig script execution was successful. We did not lose any records and a total of 5 records were written to the HDFS directory /hdpcd/output/post21_1.
This completes the execution part of the first pig script.
- PIG SCRIPT EXECUTION – 2
Coming to the second pig script execution, it is similar to the first one. You can use the following command to run this pig script.
The execution of the above command looks as follows.
And the output of this script is as follows.
As you can see from the above screenshot, this operation was also successful and the output HDFS directory /hdpcd/output/post21_2 is loaded with 5 records.
This completes the pig script execution.
Now let us go to HDFS and see how many part files are created in each HDFS directory.
- OUTPUT HDFS DIRECTORY – 1
We can use the following commands to view the content of the first output HDFS directory.
hadoop fs -ls /hdpcd/output/post21_1
hadoop fs -cat /hdpcd/output/post21_1/*
Above commands give the following output.
As we can see that a total of 4 part files and a _SUCCESS file was created, as expected. This confirms that the SET default_parallel 4 command worked perfectly.
- OUTPUT HDFS DIRECTORY – 2
The following commands are used for checking the second output HDFS directory.
hadoop fs -ls /hdpcd/output/post21_2
hadoop fs -cat /hdpcd/output/post21_2/*
The screenshot of the above commands’ execution is as follows.
The above screenshot suggests that a total of 6 part files and a _SUCCESS file was created after executing the second pig script, fulfilling the expectations.
Here, we can conclude this tutorial saying that both SET and PARALLEL commands work as expected. It is up to you which one you want to use during the certification exam. If you want to know my opinion, I would go with the SET command as I would not have to worry about where to put the PARALLEL command in the pig script.
I hope this tutorial is helpful and making a sense to you.
Please reach out to me for any help that you might need.
Hope to see you soon and in the next tutorial, we are going to perform the most-awaited JOIN operation. Stay tuned.