In this tutorial, we are going to see how the Union operation works.

In English Language, union means combining two things. Here, we are also going to do the same thing. The difference is, we are going to attach two RDDs using Union operation.

We are using the same input.txt file we used in last tutorial. To achieve Union operation, we will first filter this file based on two keywords, “Milind” and “fun”. Once we get RDDs corresponding to these two keywords, we will union these two RDDs and will store the output in the third RDD. We will print the output of this third RDD which is known as our final output RDD.

We are using following python code for executing this task.

Following are the step by step screenshots of code execution.

The code looks like following when we write it in Notepad++.

Union Operation in Spark RDD
Union Operation in Spark RDD

After writing above code, we execute it with the help of following command.

$ spark-submit

It is evident from the below screenshot.

Union Operation Execution command
Union Operation Execution command

Above executed command gives us the following output.

Union Operation Output
Union Operation Output

As you can see in the original file, line 1 and 4 contained “Milind” and “fun” keyword respectively which is printed on the output terminal. LINE 1 and LINE 2 is printed on the terminal window to show line number in the final RDD and not the original file. Hope this clears some possible confusion about the output.

In this way, we implement the Union operation in Apache Spark with Python API.

Hope you had a great read.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s