Post 17 | HDPCD | Storing Pig Relation in HDFS Directory

Thanks for coming back for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to remove the records with the NULL values, whereas in this tutorial, we are going to see the process of storing the output of a Pig Relation in the HDFS directory.

This is one of the simplest tasks to do in this certification and therefore, I have kept this tutorial as simple as possible. We are going to store the input file, as is, without any further operation in HDFS directory.

We are going to follow the following step by step process.

apache-pig-storing-pig-relation-in-hdfs-directory
workflow: storing pig relation in HDFS directory

If you take a close look at the above picture, you will see that we are not doing anything to the input file and directly storing it into HDFS directory.

We are going to perform these tasks one by one as shown below.

  • INPUT FILE CREATION IN LOCAL FILE SYSTEM

So, let us take a look at the input file. The input file looks like as follows. I have uploaded this file to my GitHub profile under HDPCD repository and you can download it by clicking here.

We can use the vi editor for creating this file. The following commands do help us to do the same.

vi post17.txt

#####

PASTE THE CONTENTS HERE

#####

(esc):wq(enter)

cat post17.txt

The following screenshot might be helpful to get the idea about above command execution.

input file
input file

Once we have the file in local file system, it is time to push this file to HDFS.

  • PUSH INPUT FILE TO HDFS

We can push this input file from the local file system to HDFS with the help of put command. The command is as follows.

hadoop fs -mkdir /hdpcd/input/post17

hadoop fs -put post17.txt /hdpcd/input/post17

hadoop fs -cat /hdpcd/input/post17/post17.txt

The execution of these commands is shown in the below screenshot.

input file to HDFS
input file to HDFS

As you can see that the file is loaded successfully in HDFS.

The next step is to create the pig script which will store this input file into HDFS directory.

  • PIG SCRIPT CREATION

Let us create the pig script for executing the objective of this tutorial. I have uploaded this pig script to my GitHub profile under HDPCD repository and you can download this script by clicking here. The pig script looks as follows.

Let us see each command in the above pig script.

input_data = LOAD ‘/hdpcd/input/post17/post17.txt’ USING PigStorage() AS (line:chararray);

Above command is used for storing the input file in a pig relation called input_data. Each line is represented by the variable called line, having the datatype chararray.

STORE input_data INTO ‘/hdpcd/output/post17’;

Above STORE command is the one which we are studying in this tutorial. STORE command is used for storing the data stored in the pig relation into HDFS directory. You can specify the delimiter with the help of USING PigStorage() option which I haven’t used in the above command.

Hope the explanation makes sense.

It is time to run the above script.

  • RUN PIG SCRIPT

We must use the HDFS mode to run this script, as we have to store the output in HDFS directory. We can use either the default MapReduce mode or the TEZ mode. To make this script run faster, we are going to use the TEZ mode and are going to run the following command.

pig -x tez post17.pig

Let us take a look at the execution of the above command.

running pig script
running pig script

Following screenshot shows the output of the above command which we ran.

pig script output
pig script output

As can be seen from the above screenshot, the pig script was run successfully and it was able to store the output under HDFS directory /hdpcd/output/post17. Let us observe this output HDFS directory.

  • OUTPUT OBSERVATION

Let us take a look at the output directory.

We will use the following commands to check the output directory.

hadoop fs -ls /hdpcd/output/post17

hadoop fs -cat /hdpcd/output/post17/part-m-00000

The following screenshot gives us an idea about the output.

HDFS output directory
HDFS output directory

As you can see from the above screenshot, the output file part-m-00000 gives us the exact contents of the input file. This concludes the tutorial right here.

Hope the text and the screenshots make sense. Kindly comment and share it with your friends and network. In the next tutorial, we are going to see how to stroe the pig relation in the hive table.

Please follow my blog for the further updates. Kindly click here to like my facebook page. You can follow me on twitter here. You can subscribe to my YouTube channel by clicking here to get updates regarding the video tutorials. You can check out my LinkedIn profile here.

Cheers!

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s