Post 27 | HDPCD | Invoke a User Defined Function in Apache Pig

Hello everyone, thanks for coming back to the last tutorial in the DATA TRANSFORMATION category of the HDPCD certification. We are going to pick-off things from the last tutorial, in which, we saw how to define an ALIAS to a function present in the JAR file. In this tutorial, we are going to see how to invoke/call that ALIAS to perform the desired operation.

Let us start then.

We are going to follow the process mentioned in the below infographics.

The big picture: Invoking UDF in Apache PIG
The big picture: Invoking UDF in Apache PIG

Please follow the below steps one by one.

  • CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM

We can create the input CSV file using the VI Editor which we have done numerous times in the past.

I have uploaded this input CSV file to my GitHub profile in HDPCD repository with name 37_input_UDF_invoke.csv“. You can download this file by clicking here and it looks something like this.



1 Richard Hernandez XXXXXXXXX XXXXXXXXX 6303 Heather Plaza Brownsville TX 78521
526 Kimberly Barrett XXXXXXXXX XXXXXXXXX 7988 High Jetty Brownsville TX 78521

view raw

post27.csv

hosted with ❤ by GitHub

You can use the following commands to create this input CSV file in local file system.

vi post27.csv

############################

PASTE THE COPIED CONTENTS HERE

############################

cat post27.csv

The following screenshot shows the execution of the above commands.

Step 1: creating input csv file in local file system
Step 1: creating input csv file in local file system

As the input CSV file is ready in the local file system, it is time to push it into HDFS.

  • PUSHING INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS

You can use the following commands to push this file from the local file system to HDFS.

hadoop fs -mkdir /hdpcd/input/post27
hadoop fs -put post27.csv /hdpcd/input/post27
hadoop fs -cat /hdpcd/input/post27/post27.csv

The output of the above commands is shown in the following screenshot.

Step 2: pushing input csv file from local to HDFS
Step 2: Pushing input csv file from local to HDFS

Above screenshot confirms that the post27.csv file was successfully created to the HDFS directory /hdpcd/input/post27.

Let us start working on the pig script creation now.

  • CREATING PIG SCRIPT TO INVOKE UDF

The objective of this pig script is to invoke the UDF present in one of the JAR files. Since we have to invoke the UDF, we first need to register the jar file. Define an ALIAS for the function and then invoke that ALIAS.

I have uploaded this input CSV file to my GitHub profile in HDPCD repository with name 38_UDF_invocation.pig“. You can download this file by clicking here and it looks something like this.


— this pig script demonstrates how to invoke an UDF in Apache PIG
— LOAD command is used to load data from post27.csv into input_data pig relation
input_data = LOAD '/hdpcd/input/post27/post27.csv' USING PigStorage(',');
— we need to register the jar file first with REGISTER command
REGISTER /usr/hdp/2.3.0.0-2557/pig/piggybank.jar;
— defining an alias for the fully qualified class name UPPER
DEFINE upper org.apache.pig.piggybank.evaluation.string.UPPER;
— invoking upper ALIAS for UPPER() function in piggybank.jar file
— only first name and last name is extracted from this input file into upper_data pig relation
upper_data = FOREACH input_data GENERATE upper($1) as fname, upper($2) as lname;
— STORE command is used to store upper_data in HDFS directory /hdpcd/output/post27
STORE upper_data INTO '/hdpcd/output/post27' USING PigStorage(' ');

view raw

post27.pig

hosted with ❤ by GitHub

Let us go through this pig script one command at a time.

input_data = LOAD ‘/hdpcd/input/post27/post27.csv’ USING PigStorage(‘,’);

LOAD command loads the data stored in post27.csv file into a pig relation called input_data.

REGISTER /usr/hdp/2.3.0.0-2557/pig/piggybank.jar;

REGISTER command is used for registering the piggybank.jar file to the pig session.

DEFINE upper org.apache.pig.piggybank.evaluation.string.UPPER;

DEFINE command is used for defining an ALIAS for the fully qualified class name present in the REGISTERed jar file.

upper_data = FOREACH input_data GENERATE upper($1) as fname, upper($2) as lname;

The above command is used for invoking the ALIAS “upper” on columns column number 2 and 3 (index $1 and $2).

Now, let us create this pig script with the help of VI editor.

vi post27.pig

############################

PASTE THE COPIED CONTENTS HERE

############################

cat post27.pig

You can use the following screenshot as the reference for creating this pig script.

Step 3: creating pig script to demo UDF invocation
Step 3: creating pig script to demo UDF invocation

The above screenshot confirms that the pig script was created successfully.

Let us run this pig script now.

  • RUN PIG SCRIPT

We can use the following command to run this pig script.

pig post27.pig

The execution of this pig script looks as follows.

Step 4: Running pig script to invoke UDF
Step 4: Running pig script to invoke UDF

And the output of this pig script looks like this.

Step 4: Pig script execution output
Step 4: Pig script execution output

As you can see from the above screenshot, pig script was successful in execution. A total of 2 records were read from /hdpcd/input/post27 directory and 2 records were stored in /hdpcd/output/post27 directory.

It shows that there is no record/data loss, but we need to make sure that upper ALIAS worked perfectly and we got the expected output.

Let us log into HDFS for cross-checking the output records.

  • CHECK OUTPUT HDFS DIRECTORY

You can use the following commands to check the contents of the output HDFS directory.

hadoop fs -ls /hdpcd/output/post27
hadoop fs -cat /hdpcd/output/post27/part-m-00000

The execution of the above commands looks as follows.

Step 5: output HDFS directory contents
Step 5: output HDFS directory contents

As you can see from the above screenshot, both the first name and last name were printed in the UPPERCASE LETTERS which was the expected output of the upper ALIAS defined in the pig script.

This concludes the tutorial here. And with this tutorial, I am happy to announce that the DATA TRANSFORMATION section of the HDPCD certification is over and from the next tutorial onwards, we are going to start off with DATA ANALYSIS section, which is the last one in the HDPCD certification series.

I hope the contents are making sense and you are getting what I want to convey.

You can check out my website at www.milindjagre.com

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

See you soon.

Cheers!

 

Published by milindjagre

Please reach out to me at milindjagre@gmail.com for further information.

One thought on “Post 27 | HDPCD | Invoke a User Defined Function in Apache Pig

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.