Spark + Python – Filter Operation

This is the first program in Spark + Python series. In this tutorial, we are going to see the Filter operation.

The objective of this tutorial is to print only those lines containing specified keyword. For doing this, we are going to follow below steps.

But before diving into actual operations, please look for the tools that we are going to use with the help of this link.

We are following below steps to achieve this goal.

  1. Create Input File in Local System.
  2. Put file in HDFS.
  3. Write Python Code.
  4. Run Python Code written.
  5. Verify the output.

STEP 1 : CREATE INPUT FILE IN LOCAL FILE SYSTEM

As discussed in the Spark + Python : Tools Setup tutorial, we are going to use Notepad++ to create an input file in the Local File System.

Please follow below screenshots for executing this step successfully.

  • Right click on the directory name in NppFTP window in which you want to create a new file. In my case, I want to create a new file under spark_with_python directory, therefore I right clicked on that directory and then clicked on Create new file.

creating-input-txt-file

  • It will open a new prompt asking for the file name. I am giving input.txt. It looks something like this.

giving-the-file-name

  • Then you should write the content in that file by opening it in Notepad++. You can write the content in this file just like a normal file. Once you are done typing the contents, save the file by pressing Ctrl+S.

input-file-content

  • Once you save this file, you can see it being reflected in the same directory. This confirms the creation of input.txt.

File Created.PNG

STEP 2 : PUT FILE IN HDFS

Once the file is created in Local File System, out next task is to put this file in HDFS.

For this, we will use hadoop fs -put command.

The syntax looks like this

$hadoop fs -put input.txt /

input-file-copying-to-hdfs

Above screenshot shows the successful transfer of the file from local file system to HDFS.

After successful transfer, our HDFS looks like this.

hdfs-overview

 

STEP 3 : WRITE PYTHON CODE

For writing Python code, we are going to use Notepad++. I have uploaded this code on my github profile and you can see it below.

It looks as follows  in Notepad++ window.

python-file

 

STEP 4 : RUN WRITTEN PYTHON CODE

We run above python code by using spark-submit command and the syntax looks as follows.

$spark-submit filter.py

Below screenshot shows the starting of execution.

running-python-file

 

STEP 5 : VERIFY THE OUTPUT

The output is verified by carefully looking at the output trace.

Since our input file contains only one line containing Milind keyword, the output should also show us only one line.

Below screenshot shows the output window.

Python File Output.PNG

As you can see, it printed only one line, therefore we can say that our code is working fine and we got the expected output.

Hope this helps.

Have a good one.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s