Hello everyone, in this tutorial, we are going to see the 2nd objective in data ingestion category.

The objective is listed on Hortonworks website under data ingestion and looks like this.

Objective 2
Objective 2

In the previous objective, we imported entire records from a MySQL table, whereas in this post, we are going to import data based on the result of a query. This query can be any SQL query, which executes successfully.

Our objective today is to import data based on some condition.

Let us look at the input data.

It looks something like this.

MySQL Input Records
MySQL Input Records

From above screenshot, you can see that we are targeting these nine records. These 9 records are fetched with the help of following SQL query.

select * from customers WHERE customer_zipcode=78521

Now, it is time to take a look at the Sqoop command.

Now, let us examine each line individually.

Line 1, 2, 3, 4: Same as explained in previous post.

Line 5: – – query ‘select * from customers WHERE customer_zipcode=78521 AND $CONDITIONS’ \

Explanation: This line includes the query whose output will actually get written to HDFS. You might notice, along with the query there are two extra words “AND $CONDITIONS”. These two words are expected to be written along with the –query clause. Sqoop expects these keywords to replace this with an inbuilt condition to perform the import operation. Long story short, you must put these two keywords while performing the free-form query import using sqoop import command.

Line 6: – – split-by customer_id \

Explanation: –split-by flag is followed by the column name which becomes the key while doing the importing data using the MapReduce job that gets triggered. As we all might know, MapReduce uses key-value pairs to transfer data from input to mapper, from mapper to either reducer or final output and reducer to final output. This flag makes that column the key of this transfer operation and it is a must while performing free-form query import. If you want to skip this option, then you must pass number of mappers with the help of -m or –num-mappers flag, as we saw in the previous post.

Line 7:- – target-dir /user/root/custom_query

Explanation: –target-dir flag is followed by the non-existing HDFS directory path. Again, this is also a must have flag while performing free-form query import. This option indicates that the target directory for the current import opearation should not be the default directory as multiple tables might be involved in the import operation. The directory should not be already existing. If the directory already exists, then Sqoop throws and error saying the output directory already exists.

The Sqoop Import command execution looks like this.

Sqoop Import Command
Sqoop Import Command

And the output window appears like this.

Sqoop Import Output
Sqoop Import Output

The last line clearly indicates that 9 records were imported, which is what we expected.

Now it is time to see actual output records.

For that, we will use following commands.

hadoop fs -ls /user/root/custom_query

hadoop fs -cat /user/root/custom_query/*

The output screenshot is shown below.

HDFS Output
HDFS Output

The screenshot clearly shows the expected output records. You can compare it with the first image.

This completes free-form query import.

I hope the content is helpful. Please subscribe to my YouTube channel here.

Stay tuned for further updates.



One thought on “Post 4 | HDPCD | Free-form Query Import

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s