Post 32 | HDPCD | Defining a Bucketed Hive Table

Hello everyone to the next tutorial in the HDPCD certification series.

In the last tutorial, we saw how to create a Partitioned Hive Table. In this tutorial, we are going to see how to create a Bucketed Hive table.

The process is depicted in the following infographics.

Apache Hive: Creating a BUCKETED table
Apache Hive: Creating a BUCKETED table

As you can see from the above picture, it follows the same process like the previous tutorial, with one additional step.

Let us get started, then.

  • CHECKING WHETHER TABLE ALREADY EXISTS IN HIVE

We have and will use the “show tables;” command to check the pre-existence of the hive table.

show tables;

If you run the above command in the hive terminal window, you will get the list of tables that already exist in the “default” database. The following picture shows the execution of the above command.

Step 1: Checking the pre-existence of hive table
Step 1: Checking the pre-existence of hive table

As you can see from the above screenshot, the table post32 does not exist. This indicates that we can go ahead and create the table with name post32.

  • DEFINING NUMBER OF BUCKETS FOR THE HIVE TABLE

It is the most important step while creating the BUCKETED hive table. As you have seen in the previous tutorial to partition the data based on some column, in this tutorial, we do follow the same concept.

The main difference between PARTITIONING and BUCKETING is that PARTITIONING can be DYNAMIC in nature whereas BUCKETING is always STATIC.

The above sentence indicates that once you define the NUMBER OF BUCKETS, you cannot change it. If you want to change it, then you will have to drop the table and redo the entire process.

Also, defining the number of buckets and the column to be bucketed depend on the data. Let us consider, we want to do bucketing on column ID with values from 1 to 1000. Then, if we decide to go ahead and create 4 buckets, then the distribution of records will be as follows.

Bucket1: 1 to 250
Bucket2: 251 to 500
Bucket3: 501 to 750
Bucket4:  751 to 1000

And once these buckets are formed, if you execute a WHERE query, then it will be redirected to the specific bucket, saving time and resources.

I hope this explains how the buckets work in Hive.

Now, let us start designing the schema for this BUCKETED Hive table.

  • DESIGN SCHEMA FOR THE BUCKETED HIVE TABLE

We are going to store the same student information in this BUCKETED hive table. Therefore we are going to follow the same schema.

Column 1: ID (Int)
Column 2: Name (String)
Column 3: Gender (String)

Bucketed Column: ID

I have uploaded this schema to my GitHub Profile under HDPCD repository with name “44_hive_bucketed_table.sql“. You can download this schema by clicking here and it looks as follows.

As you can see from the above schema design, there is a total of 3 columns: ID, Name, and Gender. The number of buckets is defined with a command like “clustered by (id) into 4 buckets” which indicates that there should be 4 buckets created on the column ID. And the rest of the things are similar to the usual hive tables.

Now, if the schema looks good to you, let us go ahead and execute this schema design.

  • CREATE THE BUCKETED HIVE TABLE USING SCHEMA DEFINED ABOVE

You can copy and paste the above schema on the Hive terminal window.

The execution of the above command looks like this.

Step 2: Creating Hive Bucketed Table
Step 2: Creating Hive Bucketed Table

As you can see from the above screenshot, we are getting “OK” message, which indicates that the table post32 was created successfully and there was no error while doing so.

Let us confirm the existence of the Hive BUCKETED table post32.

  • CONFIRMING EXISTENCE OF HIVE BUCKETED TABLE – POST32

We are going to use the “show tables;” command to confirm the existence of the Hive Bucketed table.

show tables;

The execution of the above command looks like this.

Step 3: Confirming the existence of Bucketed Hive Table
Step 3: Confirming the existence of Bucketed Hive Table

As you can see from the above screenshot, the table post32 was created successfully. This confirms that our command worked fine and the destined “post32” table was created.

Now, it is time to check the schema and see the number of buckets for table post32.

  • CHECKING SCHEMA AND NUMBER OF BUCKETS OF BUCKETED HIVE TABLE

We use an extensive version of the “desc” command to check the full schema along with the number of buckets and other properties of the Hive tables.

This command goes something like this.

desc formatted post32;

The keyword “formatted” tells hive to show the detailed information of the table post32 in the formatted manner or readable format.

The output of the above command looks like this.

 

Step 4: Checking schema and Number of Buckets of Bucketed Hive table
Step 4: Checking schema and Number of Buckets of Bucketed Hive table

As can be seen from the above screenshot, the schema of the post32 hive table matches with the defined schema which we used while creating this table. Also, you can see at the bottom, the number of buckets is shown to be 4 and the column on which bucketing is applied is the ID column.

This confirms that Hive was able to process the information that we passed while creating the hive table. This concludes the process of creating a PARTITIONED HIVE table.

Hope you guys like the contents. Keep supporting my website and blog like this.

I am graduating in a week’s time and looking for full-time opportunities in the field of Hadoop and Big Data Analytics.

You can check out my website at www.milindjagre.com.

 

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s