Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.
In the last tutorial, we saw how to create the hive external table. In this tutorial, we are going to see how to create a partitioned Hive table.
For doing this, we are going to follow the following process.
As you can see from the above infographics, it follows the same process as we have seen already in the previous tutorials.
Let us start off with the partitioned table creation in Apache Hive.
- CHECK WHETHER PARTITIONED HIVE TABLE ALREADY EXISTS
We do check the pre-existence of the partitioned Hive table with the help of “show tables;” command.
The following screenshot shows the output of the above command.
As you can see from the above screenshot, the table post31 does not exist, therefore we are free to go ahead and create the partitioned Hive table post31.
- TABLE DEFINITION FOR THE PARTITIONED HIVE TABLE
Now, it is very important to decide on the table definition, especially in the case of Partitioned Hive Table. In this case, we need to think about the columns on which the table is going to be partitioned and accordingly place those column names in the PARTITIONED BY clause of the Hive CREATE command syntax.
As you have seen in the previous tutorial, we are using the student information to create the Hive tables. In this case, as well, we are going to create Hive tables by considering that we are going to store the student information.
Let me give you an idea about the input data file.
I have uploaded this input file to my GitHub Profile under HDPCD repository with name “42_input_partition_hive_table.csv“. You can download this file by clicking here and it looks something like this.
As you can see from the above file snippet, the input file contains a total of 5 columns. These columns are as follows.
Column 1: ID (Integer)
Column 2: Name (String)
Column 3: Gender (String)
Column 4: Year (Integer)
Column 5: Month (Integer)
Now, our aim is to create a Partitioned Hive Table in such a way that the first three columns i.e. Column 1, 2, and 3 should be the main columns of the table whereas the rest i.e. Columns 4 and 5 should be the partitioned columns of the Hive table.
You might ask, what is the need of the Partitioned Columns or Tables. In the simplest words, we can answer this question by saying that if a column is Partitioned, then it takes less time to execute a WHERE CLAUSE as compared to the non-Partitioned columns.
If this is clear, then let us design the CREATE TABLE command for the Partitioned Hive Table post31.
I have uploaded this CREATE COMMAND to my GitHub Profile under HDPCD repository with name “43_hive_partitioned_table.sql“. You can download this file by clicking here and it looks something like this.
If you have a close look at the above CREATE command, you will see that the NON-PARTITIONED COLUMNS i.e. ID, Name, and Gender are in the main body of the CREATE command, whereas the PARTITIONED COLUMNS i.e. YEAR and MONTH are in the PARTITIONED BY clause of the CREATE command.
This tells Hive which columns to treat PARTITIONED COLUMNS and which to not.
Rest of the commands are self-explanatory similar to the last tutorial.
Now, let us create this post31 table in Hive’s “default” database.
- CREATING HIVE PARTITIONED TABLE – POST31
We are going to use the same command as mentioned above to create the Hive Partitioned Table.
The following screenshot shows the process of the Partitioned Hive Table creation.
As you can see from the above screenshot, the Partitioned Hive Table post31 was successfully created and we got the “OK” message at the end of the command execution.
Once, the table is created, it is time confirm for the same.
- CONFIRMING THE EXISTENCE OF THE PARTITIONED HIVE TABLE – POST31
We are going to use the same “show tables;” command to confirm the existence of the Partitioned Hive Table post31.
The command looks like as follows.
The execution of the above command gives us the following output.
As you can see from the above screenshot, the Partitioned Hive Table post31 was successfully created in Hive’s “default” database.
This confirms that the table post31 was created.
Now, it is time to check whether the schema matches correctly or not.
- CHECKING THE SCHEMA OF THE PARTITIONED HIVE TABLE – POST31
As we have seen in the last tutorial, we are going to use “desc” command to check the schema of the Partitioned Hive Table post31.
The command looks like as follows.
And the output of the above command looks as follows.
As you can see from the above screenshot, the schema of the created Hive Partitioned Table post31 matches with the defined schema.
Please note that the information regarding the PARTITIONED COLUMNS is given out separately in the output of the desc command. This confirms that Hive was able to identify the difference between the PARTITIONED COLUMNS and NON-PARTITIONED COLUMNS.
This completes the process of creating the Partitioned Hive Table.
Hope you people like the content. Please share it with your network.
This is going to be the LAST FORTNIGHT of my STUDENT LIFE as I am graduating in a couple of weeks. I wish to spread my learnings to a larger audience and more frequently. Therefore, I will take few decisions once I get a full-time job offer from a good employer.
Hope that day is not much away when I will be able to associate myself with a good company. Fingers crossed. 🙂
You can check out my website at www.milindjagre.com