Spark : map() and flatMap()

Hi guys,

Hope you are finding the tutorials helpful.

In this tutorial, we are going to see the two transformations which we are going to use a lot while learning Spark. Both map() and flatMap() functions are transformations in Spark. We will discuss these two transformations one by one. Then will see the similarities between these two followed by the differences.

  • map() transformation

We can use map() function to do number of things. We can perform transformations on any data i.e. numbers and strings.

The input and output to this transformation is an RDD.

The input and output data types may not be the same for map() function. It means that if input is RDD[String], it does not mean the output is also going to be RDD[String], it can be RDD[int] also and because of this quality it is called as transformation.

each element in RDD -> map() -> output of each element in new RDD

  • flatMap() transformation

flatMap() function is somewhat similar and different as compared to flatMap() function.

flatMap() is called on each element in an RDD and it can produce more than one output element for each element in the input RDD.

  • SIMILARITIES between map() and flatMap()

Both map() and flatMap() are transformations.

Both map() and flatMap() expect input and output as RDD.

  • DIFFERENCES between map() and flatMap()

map() function is applied on each element and produces new value for each element in the output RDD. flatMap() function instead of returning values in output RDD, it returns an iterator with return values.

map() output is an RDD whereas flatMap() output is RDD containing elements of all iterators.

Below picture shows the basic difference between map() and flatMap() when applied on same input RDD.

As you can see, mapRDD contains list of elements broken down based on SPACE as the delimiter. flatMapRDD contains all the elements as one single list, there are no lists in the resulting RDD.

Visual presentation : difference between map() and flatMap()
Visual presentation : difference between map() and flatMap()

We are going to demonstrate the above-mentioned difference programmatically with the help of following python file.

Same file you can see in Notepad++ as follows.

python file in notepad++
python file in notepad++

We are going to run above file with the help of following command

running python file
running python file

and it gives us the following output

map() output

map output
map output

flatMap() output

flatMap output
flatMap output

As you can see from above two screenshots, map() output gives the output RDD in lists, whereas flatMap() return the resultant RDD in one single format i.e. without the list

If we look into details, the output of map() and flatMap() looks something like this

map funtion output
map function output

 

flatMap funtion output
flatMap function output

This clearly shows us the difference between map() and flatMap() transformations.

Hope this makes sense to you guys.

Thanks for having a read.

Suggestions are welcome. Please do share this in your network if you like it.

Till then, Cheers!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s