Post 45 | HDPCD | Join two Hive tables

 

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to insert a new row into a Hive table.

In this tutorial, we are going to see how to join two Hive tables.

Let us begin, then.

Apache Hive: Performing join operation
Apache Hive: Performing join operation

The above info-graphics show the step by step process of performing the JOIN operation in Apache Hive.

Let’s start doing the JOIN, then.

  • CREATING INPUT CSV FILES IN THE LOCAL FILE SYSTEM

We use the vi editor to create input CSV files in the local file system.

Here, we are going to create two input files, since, we are joining two tables in this tutorial.

The first file, named, post45_a.csv, contains the customer information, such as, customer ID, customer name, and customer gender.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “56_first_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.

You can use the following commands for creating this file.

vi post45_a.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_a.csv

The output of the above commands look as follows.

Step 1: Creating first input file in the local file system
Step 1: Creating first input file in the local file system

As you can see from the above screenshot, input CSV file post45_a.csv was created successfully in the local file system.

We will follow the same process for creating the second input CSV file.

The name of this file is post45_b.csv and it contains the order information by the customers. It contains fields such as customer ID, timestamp, order amount, and the order status.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “57_second_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.

You can use the following commands for creating this file.

vi post45_b.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_b.csv

The output of the above commands look as follows.

Step 2: Creating second input file in the local file system
Step 2: Creating second input file in the local file system

As you can see from the above screenshot, input CSV file post45_b.csv was created successfully in the local file system.

Now, we will move on to the Hive operations.

  • CHECKING PRE-EXISTENCE OF THE HIVE TABLES

We use the “show tables;” command to get the list of tables in the Hive database. This list reveals whether our tables already exist or not.

show tables;

The output of the above command is as follows.

Step 3: Checking pre-existence of the Hive table
Step 3: Checking pre-existence of the Hive table

As you can see from the above screenshot, Hive does not contain tables with names post45_a and post45_b. Therefore, we can go ahead and create those tables with same schema as the input CSV files.

  • CREATING TABLES IN ACCORDANCE WITH THE INPUT CSV FILES

We are going to create the two tables as we want to perform the JOIN operation.

The first table is post45_a, containing customer information and having three columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “58_first_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

The output of the above commands look as follows.

Step 4: Creating the first Hive table with customer information
Step 4: Creating the first Hive table with customer information

As you can see from the above screenshot, the Hive table post45_a was created successfully. We will follow the same process for creating the other table post45_b.

The second table is post45_b, containing order information and having four columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “59_second_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

The output of the above commands look as follows.

Step 5: Creating second table with order information
Step 5: Creating second table with order information

As you can see from the above screenshot, the Hive table post45_b was created successfully. Now, it is time to confirm that the Hive tables were created and that also with the correct schema.

  • CONFIRM TABLE EXISTENCE AND SCHEMA

We use the same “show tables;” command to check the existence of the Hive tables.

show tables;

The output of the above command is as follows.

Step 6: Confirming that the Hive tables got created successfully
Step 6: Confirming that the Hive tables got created successfully

As you can see from the above screenshot, both post45_a and post45_b tables were created successfully.

Let us check their schema as well. We use the following commands for doing this.

desc formatted post45_a;
desc formatted post45_b;
The output of the above command is as follows.
Step 7: Checking the schema of the customer information table
Step 7: Checking the schema of the customer information table

The above screenshot shows that the Hive table post45_a was created with the correct schema and it is safe to do the data load operation.

 

Step 8: Checking the schema of the order information table
Step 8: Checking the schema of the order information table

The above screenshot shows that the Hive table post45_b was created with the correct schema and it is safe to do the data load operation.

Let us load the data in these Hive tables, now.

  • LOAD INPUT CSV FILES IN HIVE TABLES

We use the “LOAD” command for loading the data into the Hive tables.

load data local inpath ‘post45_a.csv’ into table post45_a;
The output of the above command is as follows.
Step 9: Loading the data into the customer information table
Step 9: Loading the data into the customer information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_a.

load data local inpath ‘post45_b.csv’ into table post45_b;
The output of the above command is as follows.
Step 10: Loading the data into the order information table
Step 10: Loading the data into the order information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_b.

Let us check records stored in these Hive tables.

  • CONFIRM DATA GOT LOADED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive tables.

select * from post45_a;
The output of the above command is as follows.
Step 11: Checking the data got loaded successfully into the customer information table
Step 11: Checking the data got loaded successfully into the customer information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_a and there was no data loss.

And for the second table,

select * from post45_b;
The output of the above command is
Step 12: Checking the data got loaded successfully into the order information table
Step 12: Checking the data got loaded successfully into the order information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_b and there was no data loss.

This brings us to the last step in this tutorial and that is to perform the JOIN operation.

  • PERFORM THE JOIN OPERATION

We use the “JOIN” keyword for performing the JOIN operation. Based on the join type, you can use INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN keywords. For more information, please click here.

In this case, we are performing the JOIN operation based on the CUSTOMER ID which is the first column in case of both the tables.

Our objective is to extract order details of only those customers whose data is present in the customer (post45_a) table. Therefore, we should get order details of customers with Customer ID from 1 to 9 (both inclusive).

The command to do this is as follows.

select a.name,a.gender,b.order_date,b.order_status from post45_a a join post45_b b on (a.id = b.order_id);
The output of the above command is as follows.
Step 13: Performing the JOIN operation between Hive tables - customer and order
Step 13: Performing the JOIN operation between Hive tables – customer and order

As you can see, the above “JOIN” query triggers a MapReduce/TeZ job.

We get the result of this query after this query executes successfully. The output is shown in the above screenshot.

The customers with customer ID from 1 to 9 is shown in the screenshot, as expected, along with their order details.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to run a Hive query using TEZ as the execution engine.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s