In this tutorial, we are going to various ways in which we pass functions in Spark using Python API.

I have shown two ways in which functions can be called/created (for user-defined function).

We are going to do the comparison based on filtering capabilities of Spark. For doing this I have created a user-defined function called containsMilind() which you can see in line no 4 of below python code.

Apart from this containsMilind() function, I am using one more function called filter() which you can see in line 7 of below python code.

I would request you guys to go through the code and understand what I am doing with each function.

Let me explain the code a little bit.

In above code, I am trying to filter out all the lines that contain the keyword “Milind“. I am doing this by two ways. The first way is to create a user-defined function called containsMilind().

  • The first way is to create a user-defined function called containsMilind(). In containsMilind(), I am using Python’s functionality to return either true or false. If the input string (each line in input file) contains the keyword “Milind“, then that will appear in the output, otherwise that line will be omitted.
  • The second way to get the output is using the default filter() function. It is simpler to write and use. It is a Python function which expects a lambda function as a formal argument. This lambda function is responsible for filter operation on each line in a spark RDD. Each RDD is evaluated for “Milind” keyword. Lines containing “Milind” keyword gets printed while the rest of the lines are omitted.

Following screenshot shows the python code written in Notepad++.

Spark + Python : Passing Function : Python Code in Notepad++
Spark + Python : Passing Function : Python Code in Notepad++

 

Following is the command which we use to run this Python code.

Spark + Python : Passing Function : Running Python Code
Spark + Python : Passing Function : Running Python Code

 

Below screenshot shows the output of LAMBDA function i.e. filter() function.

Spark + Python : Passing Function : LAMBDA Function Output
Spark + Python : Passing Function : LAMBDA Function Output

 

Below screenshot shows the output of USER-DEFINED function i.e. containsMilind() function.

Spark + Python : Passing Function : USER Function Output
Spark + Python : Passing Function : USER Function Output

 

I hope you find this article useful and interesting like I did.

Have a good one. Cheers!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s