Wassup everyone. Thanks for coming back once again. This section is coming to an end with less than 10 tutorials remaining. Once we are done with this section (Apache Pig), we will start with the next section which focuses on Apache Hive.
In the last tutorial, we saw the process of storing the data stored in pig relation into an HDFS directory. This tutorial focuses on storing the same data in the hive table, instead of storing it in HDFS.
Let us start with this tutorial then.
We are going to follow the below steps to perform this activity.
We are going to refer to the above image for performing the objective of this tutorial.
Let us start, then.
- INPUT FILE CREATION IN HDFS
As always, we are going to use the vi editor to create this input file. I have uploaded this input file to my GitHub profile under HDPCD repository with name “19_input_pig_to_hive.csv” and you can download this file by clicking here. This input file looks as follows.
Once you download this file, please use the following commands to create it in local file system.
COPY & PASTE THE DOWNLOADED CONTENTS
The following screenshot shows how the above commands work.
Above screenshot confirms that the file was created in the local file system successfully.
Now, it is time to put this file in HDFS. For doing this, we are going to use the following set of commands.
hadoop fs -mkdir /hdpcd/input/post18
hadoop fs -put post18.csv /hdpcd/input/post18
hadoop fs -cat /hdpcd/input/post18/post18.csv
The following screenshot is there to help you with the output of the above commands.
The above screenshot confirms that the file was pushed to HDFS successfully and now we can proceed to the Pig Script part of the tutorial.
- PIG SCRIPT CREATION
Now is the time to create the pig script.
We are going to use the vi editor to create this pig script. I have uploaded this input file to my GitHub profile under HDPCD repository with name “20_pig_to_hive.pig” and you can download this script file by clicking here. This pig script looks as follows.
Let me explain the commands in the above pig script.
input_data = LOAD ‘/hdpcd/input/post18/post18.csv’ USING PigStorage(‘,’) AS (station_name:chararray, year:int, month:int, dayofmonth:int, precipitation:int, temperature_max:int, temperature_min:int);
The above command is used for storing the data stored in post18.csv HDFS file into a pig relation called input_data. As you can see, the custom schema is passed while creating this pig relation. The schema contains a total of 7 columns as present in the input file, with the defined data types for each column.
STORE input_data INTO ‘sfo_weather’ USING org.apache.hive.hcatalog.pig.HCatStorer();
This command is responsible for storing the data stored in the pig relation input_data into the hive table with name sfo_weather. For doing this, we are using one of the classes present in the HCatalog library. The fully qualified class name of this class is “org.apache.hive.hcatalog.pig.HCatStorer()“.
Now, as the script is clear, let us try to create it using the vi editor.
We use the following set of commands.
COPY & PASTE THE DOWNLOADED CONTENTS
The following screenshot shows the output of the above commands.
- RUNNING PIG SCRIPT – I
We are going to use the “TEZ mode” for running this pig script.
pig -x tez post18.pig
The below screenshot shows the output of this command.
As you can see, Pig was not able to identify the class we have given in the script file. The simple reason behind this is that we must pass -useHCatalog flag while executing this script.
- RUNNING PIG SCRIPT – II
Since the reason of the failure is clear to use, let us try to incorporate and execute this pig script with the “-useHCatalog” flag. We will use the following command to run the pig script.
pig -x tez post18.pig -useHCatalog
The below screenshot shows us the output of the above command.
As you can see from the above screenshot, the pig script failed again to execute. The reason behind this is the non-existence of the hive table. HIVE TABLE MUST EXIST WITH EXACT SCHEMA to run this pig script.
Therefore, now it is time to create the hive table with the same schema as the input_data pig relation.
- CREATING HIVE TABLE
I have uploaded this hive table creation command to my GitHub profile under HDPCD repository with name “21_hive_table_creation.hql” and you can download this script file by clicking here. This command looks as follows.
Before running this command, let us check whether sfo_weather table already exists or not. For doing this we will log into hive terminal window by running the “hive” command and then run “show tables;” command (do not forget the semicolon at the end).
AFTER LOGGING INTO HIVE TERMINAL WINDOW
hive> show tables;
The following screenshot might be able to help you out regarding this.
As you can see the default database in hive does not contain the sfo_weather table.
It means, we can go ahead and create the sfo_weather table.
Following is the screenshot of the hive create table command.
It is shown that the hive table was created successfully.
Now seems to be a good time to finally execute the pig script with some expectations about the output.
- RUNNING PIG SCRIPT – III
We will use the same command as already mentioned to run this pig script.
pig -x tez post18.pig -useHCatalog
The following screenshot shows the initial execution of the above command.
And the output of this command looks as follows.
As you can see from the above image, this operation is SUCCESSFUL.
A total of 5 records were imported from the input file post18.csv and then hive table sfo_weather was loaded with those 5 records. Please note the timestamp of this execution as we are going to run this command one more time in some time. It is 12:10 PM on April 29th.
Let us check the structure and the total number of records in the hive table.
- OBSERVE THE OUTPUT IN HIVE TABLE I
We will use the following command to observe the output in the hive table.
hive> select * from sfo_weather;
The following is the output of this above command.
As you can see that there are a total of 5 records in the hive table and the structure suggests that there are a total of 7 columns in this hive table.
This clearly indicates that the command ran successfully and it did the job what it was supposed to do.
Now, you may think that what if the hive table already contains some records before running this command. Will the existing records be overwritten or appended?
It’s time to find it out then.
- RUNNING PIG SCRIPT – IV
We will execute the pig script one more time to see it’s effect on the hive table.
The following is the output of the pig script.
As you can see from the above image, this operation like the older one is also SUCCESSFUL.
A total of 5 records were imported from the input file post18.csv and then hive table sfo_weather was loaded with those 5 records. The timestamp of this execution seems to be later than the previous operation, as expected. It is 12:15 PM on April 29th.
So, what’s the deal with the hive table? Let us find out.
- OBSERVE THE OUTPUT IN HIVE TABLE II
We will log into the hive terminal and see the total number of records in the hive table sfo_weather.
As you can see the total number of records increased from 5 to 10. It indicates that the new records were appended to the existing ones and there is no overlapping of the information.
We can conclude this tutorial here.
Before closing, let us take a look at the conclusion section below.
- THE CONCLUSION
We can conclude the following key findings from this tutorial.
- We must use the -useHCatalog flag to execute the pig script.
- Hive table must exist before running this operation.
- If the hive table already exists (with or without any records), the new records will be appended to the existing records and no overwriting will occur.
Hope you guys liked the content. Please follow my blog for the further updates. You can like my Facebook page here, follow me on twitter here. Please check out my LinkedIn profile here and subscribe to my YouTube channel here.
See you soon.