Post 12 | HDPCD | Load data from Hive to Pig

Hello, everyone. Thanks for coming back! I Hope the tutorials are inspiring you to take each task seriously and perform each operation by understanding why we are performing each step.

In the last tutorial, we saw how to create the Pig Relation with a defined schema. This tutorial is about creating a Pig Relation, but instead of loading data from a flat file, the data will be loaded from the already existing Hive Table. So, let us take a quick look at the steps included in performing this operation.

Below picture gives you a clear idea about the steps we are going to follow to load data from Apache Hive to Apache Pig.

Load data from hive to pig
Load data from hive to pig

The first thing that we are going to do is to check the hive table and its schema. Based on the schema, we will have an idea about the structure of the imported pig relation. We are going to log into hive terminal and then look for the schema of products table as it is the table whose data we would like to import into the Pig Relation.

We use the following command to log into hive and look for the products table.

hive

Once you are in the hive terminal, we can run the following command to get the list of tables.

show tables;

For your reference, following screenshot gives the output of the commands that are shown above.

hive tables list
hive tables list

Once we confirm that we have the products table in the hive database, let us look at the structure of the products table. For doing this, we can use describe and select commands which are shown below.

describe products;

select * from products limit 10;

Following screenshot shows how the above two commands run.

table structure and data sample
table structure and data sample

Once we know the data structure and sample data, it is time to write a Pig script which will import this hive data into the pig relation.

The script file to load data from Apache Hive to Apache Pig is uploaded to my GitHub profile and it looks as follows.

Now, let us go through each line to understand what is going on here.

hive_data = LOAD ‘products’ USING org.apache.hive.hcatalog.pig.HCatLoader();

EXPLANATION: Above line contains the meat of our objective of the tutorial. It loads the data from the products table in hive in a pig relation called hive_data. As you can see, there is a class involved in this import operation. The fully qualified name of this class is “org.apache.hive.hcatalog.pig.HCatLoader”. 

The above-mentioned class resides in one of the jar files in HCatalog directory and when you run above command, that jar file is used to successfully execute this operation. This is the sole reason we run above post12.pig file with the -useHCatalog flag.

DESCRIBE hive_data;

EXPLANATION: As you might be aware of it now, DESCRIBE command is used for viewing the datatypes and column in the Pig Relation hive_data.

DUMP hive_data;

EXPLANATION: The DUMP command is used for printing the contents stored in the hive_data Pig Relation. This command is not required as part of the objective, but we are still executing it to confirm that hive table data got loaded into the Pig Relation successfully.

We use the vi command to create this file in the terminal window. Once, the contents of the pig script file are created, we run the cat command to verify the file got created successfully. The following screenshot gives you a clear idea about this.

creating pig script file
creating pig script file

Once, the pig script is ready, we can run it. Let us see what happens if we use the traditional pig command to run this script.

Error in Pig script
Error in Pig script

As you can see from the above screenshot if you don’t use -useHcatalog flag with the pig command, then the command is going to fail and you will get an error saying “Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports“. This error clearly indicates Pig was not able to find the jar files required to kick-off HCatalog functionality.

To resolve this issue, we should run above command with -useCatalog flag. Once we use this flag, pig will pick up jar files required to run HCatalog services required to import hive data into pig relation. For your reference, following is the correct command used for this tutorial.

pig -useHcatalog -f post12.pig

The following screenshot shows that the file ran successfully and we got to see the output as well.

pig run script
pig run script

Following is the output of the DUMP command.

dump pig relation
dump pig relation

And the structure of the pig relation looks as follows.

describe pig relation
describe pig relation

Above screenshot shows that we got the output as expected.

I hope all the tutorials are helping you in understanding the requirements to clear the certification. In the next tutorial, we are going to see how to format the data in the specified format using pig relation.

Please stay tuned for the further updates.

You can click here to subscribe to my YouTube channel. Please like my Facebook page here and follow on twitter here.

Thanks for having a read.

Cheers!

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s