Hello everyone, in this tutorial, we are going to see the 2nd objective in data ingestion category.
The objective is listed on Hortonworks website under data ingestion and looks like this.
In the previous objective, we imported entire records from a MySQL table, whereas in this post, we are going to import data based on the result of a query. This query can be any SQL query, which executes successfully.
Our objective today is to import data based on some condition.
Let us look at the input data.
It looks something like this.
From above screenshot, you can see that we are targeting these nine records. These 9 records are fetched with the help of following SQL query.
select * from customers WHERE customer_zipcode=78521
Now, it is time to take a look at the Sqoop command.
Now, let us examine each line individually.
Line 1, 2, 3, 4: Same as explained in previous post.
Line 5: – – query ‘select * from customers WHERE customer_zipcode=78521 AND $CONDITIONS’ \
Explanation: This line includes the query whose output will actually get written to HDFS. You might notice, along with the query there are two extra words “AND $CONDITIONS”. These two words are expected to be written along with the –query clause. Sqoop expects these keywords to replace this with an inbuilt condition to perform the import operation. Long story short, you must put these two keywords while performing the free-form query import using sqoop import command.
Line 6: – – split-by customer_id \
Explanation: –split-by flag is followed by the column name which becomes the key while doing the importing data using the MapReduce job that gets triggered. As we all might know, MapReduce uses key-value pairs to transfer data from input to mapper, from mapper to either reducer or final output and reducer to final output. This flag makes that column the key of this transfer operation and it is a must while performing free-form query import. If you want to skip this option, then you must pass number of mappers with the help of -m or –num-mappers flag, as we saw in the previous post.
Line 7:- – target-dir /user/root/custom_query
Explanation: –target-dir flag is followed by the non-existing HDFS directory path. Again, this is also a must have flag while performing free-form query import. This option indicates that the target directory for the current import opearation should not be the default directory as multiple tables might be involved in the import operation. The directory should not be already existing. If the directory already exists, then Sqoop throws and error saying the output directory already exists.
The Sqoop Import command execution looks like this.
And the output window appears like this.
The last line clearly indicates that 9 records were imported, which is what we expected.
Now it is time to see actual output records.
For that, we will use following commands.
hadoop fs -ls /user/root/custom_query
hadoop fs -cat /user/root/custom_query/*
The output screenshot is shown below.
The screenshot clearly shows the expected output records. You can compare it with the first image.
This completes free-form query import.
I hope the content is helpful. Please subscribe to my YouTube channel here.
Stay tuned for further updates.