This is the first post in Data Transformation category which is essential to clear the HDPCD certification, given by Hortonworks Inc.
In the last eight tutorials, we focused on Data Ingestion tasks. The next twenty-one, yeah, that’s right, I said next twenty-one tutorial, including this one, will focus on the Data Transformation category of the certification. I know, at first, it might sound like a daunting task, but trust me, if you do all these tasks sincerely and one at a time, you will be easily able to clear the certification.
We are going to take baby steps and focus on one task at a time. So, let us start with the first objective in this category.
In this tutorial, we are going to write a simple pig script and run it. We will discuss how to execute it, as there are plenty of ways to execute a pig script and in different file system modes. So sit back and pay close attention to this tutorial. 🙂
The example here that we are looking at is to print the current working directory. Yeah, that’s right. The objective here is to execute the pig script, therefore we will focus on execution of the pig script instead of what we are going to write in the pig script.
The command which is used to print the current directory in pig is the pwd command which looks as follows.
The expected output of this command is the deafult present working directory for that mode.
So, let us see how we can execute this command.
There are following 8 ways to execute this command.
- Log into Pig Terminal and execute pwd command
- Login using Local Mode
- Login using MapReduce Mode
- Login using Tez Mode
- Login using Local Tez Mode
- Run Pig Script containing this pwd command
- Run Pig Script in Local Mode
- Run Pig Script in MapReduce Mode
- Run Pig Script in Tez Mode
- Run Pig Script in Local Tez Mode
Above eight types can be shown with a simple infographic as follows.
We are going to see these eight modes of pig script execution one by one. As you might guess, all the eight modes are going to print the same output on the terminal window, which is not our focus in this tutorial. This tutorial focuses on knowing the different modes and ways in which we can execute a pig command or pig script. So, let us get started.
- LOGIN USING LOCAL MODE
The command used for logging in Pig terminal window in the local mode is as follows.
pig -x local
This will launch the grunt shell in local mode as shown in below screenshot.
Once you are in the grunt shell, execute the following command to print on the pig terminal window.
This command prints the output as shown in below screenshot.
This completes the first mode which we used to execute our pig command.
Let us continue to the second mode.
- LOGIN USING MAPREDUCE MODE
This is the default mode when you run any pig script or query. This can be changed by tweaking the pig configuration file a little bit, which is out of the scope of this tutorial. While logging in pig in the MapReduce mode, you use the following command.
Since it is the default mode, you don’t have to specify the mode name, as you did in the case of Local Mode. Pig will automatically consider the mode as the MapReduce mode and will launch the grunt shell using this mode.
Once you are in the grunt shell, just execute the same pwd command to generate the output.
It gives the exact same output as the local mode, as we are running the same command.
This concludes the Pig Execution using MapReduce Login Mode.
Let us see what the third mode has in store for us.
- LOGIN USING TEZ MODE
Tez is one of the modes in which you can run the pig script. On a broader scale, Tez is the execution engine which can be used to replace the MapReduce execution mode.
One of the most important advantages of the Tez mode over the default MapReduce mode is that Tez is way faster than MapReduce when it comes to running the complex queries and performing operations on Big Data. The simple explanation for this is that MapReduce does the processing over the Hard Disk Drive, whereas Tez performs the same tasks in the memory, which is RAM, way faster than HDD.
We are going to use the following command to start the grunt shell using the Tex execution mode.
pig -x tez
It will start the grunt shell using the Tez execution engine, which is evident from the log entries that are going to get printed on the terminal window.
Then, again, we use the same command to print the current directory on the grunt shell window.
Run pwd command and you will be able to see the same output like the last case.
The performance gain of the different modes won’t be visible with commands like “pwd“. To actually see the performance gain, you must run some complex queries over a large amount of data.
This completes this mode.
In the next mode, we will log into the Tez Local mode and run the same command.
- LOGIN USING LOCAL TEZ MODE
This is quite an extension of the Tez mode. It is similar to the Tez mode, but instead of working on the data present in HDFS, this mode works on the data present in the local file system.
You run following command to log into the Local Tez mode.
pig -x tez_local
Once you run this command, you get the same grunt command. You run the pwd command to get the output on the grunt command prompt.
This completes all the four modes which are part of the first category “Logging to Pig and executing the commands”
We will see now how to run these pig command without actually logging into pig grunt shell. In these modes, we write all the pig commands in a pig script file and then run that script file in one shot.
Let us see all the modes which we can use to run pig script file, but before that, let us create a simple pig script file by the name print.pig with the help of vi command and then write the pwd command in that pig script file and then save it.
We use the following command to perform above activities.
Following screenshot shows the way in which these commands work.
We will use this for all the four coming modes. Let us see those modes one at a time.
- RUN PIG SCRIPT IN LOCAL MODE
We use the following command to run the print.pig file in the local mode.
pig -x local -f print.pig
Once you run this command, it prints the lines of log entries and then finally print the output on the terminal window.
Following screenshot might be helpful to clear some of the concerns about these commands’ execution.
If you compare the output of this command and the output of commands we ran previously, you will notice that while executing this command, pig will actually log into the grunt shell and then run the commands present in the print.pig file on the fly and print the output returned by those commands.
This concludes the pig local mode of script execution.
Let us have a look at the next mode of script execution.
- RUN PIG SCRIPT IN MAPREDUCE MODE
As already mentioned, MapReduce is the default mode in pig command execution, therefore we will not have to pass the mode name while running these commands. Please have a look at the following command which we are going to run in this mode.
pig -f print.pig
As you can see, we have not mentioned any mode in above command, therefore pig will automatically consider MapReduce as the mode type and run the commands present in the print.pig file in MapReduce mode.
You can see the output of above command as follows.
Now, let us see how pig script runs in the Tez mode.
- RUN PIG SCRIPT IN TEZ MODE
We use the following command to run the pig script in the Tez mode.
pig -x tez -f print.pig
It does run above command in the Tez mode and gives the output shown in following screenshot, which is not different from the output in the last mode.
Finally, let us see the last mode in this tutorial.
- RUN PIG SCRIPT IN LOCAL TEZ MODE
We use the following command to run the pig script in local tez mode.
pig -x tez_local -f print.pig
Above command executes in the same way like the last command mode. You can see that from the following output.
We have finally reached the end of this tutorial. I hope this is useful information for the upcoming tutorials on Apache Pig.