In this series, we are going to talk about the simple concepts and basic spark programming with Python API. For doing our development work faster and easier, we are going to use some basic tools and software.
The tools that we are talking about are
- Notepad ++
We use Putty to connect to the remote system on which spark is installed.
I have installed Spark 1.6.1 with Hadoop 2.6.0 on Oracle VM Virtual Machine Box. My laptop has 16 GB of memory/Ram, out of which I have allocated 8 GB for this virtual system. Once I start VM, instead of using the VM windows, I use Putty agent to connect to that VM. You can see below screenshot for reference.
Above screenshot shows you the VM Window. Now, you can see below putty window which I use for connecting to this VM.
As you can see, you have to insert the hostname and port number before you click Open to start the VM Session. Once you click Open, you will be prompted with the Enter username and password prompt. After typing correct username and password, you will be able to start the VM Session.
I use putty very often and will recommend you too if you are also learning Hadoop or Spark.
Now, it is time to talk about Notepad++.
Why do we use Notepad++ when we have putty and command line?
Many developers do not like command line nano command line in order to create/edit a file. I am one of those people (wicked smiley).
I used Notepad++ to write python and bash scripts. Notepad++ is wonderful when it comes to directly edit the file on a remote system. All you need is hostname, port number, username and password. Once you have all these things, you will be able to access files on the remote system.
Notepad++ does not come with default plugin for remote system connection, you have to install it manually.
You can follow below steps to install the NppFTP plugin.
- Open Notepad++
- Click on Plugins -> Plugin Manager -> Show Plugin Manager
- Scroll down to NppFTP
- Check the square box before NppFTP
- Click on install
- wait for the installation to finish
Once it is done, you will get NppFTP plugin entry under Plugins as shown below.
You click on NppFTP -> Show NppFTP Window and you will get the following window on the right hand side of Notepad++.
You can click on the button marked with green circle in the screenshot to add new connection.
Click on settings (green circle) -> Profile settings -> Add new. You will get following window.
You can put hostname, port number, username, password, initial remote directory as per your VM configuration. You can click Close once you are done.
Then click on the red circle button to connect to the VM settings and you will have access to that system which looks something like this.
So these are the components that we are going to use while exploring Spark use case development using python API.
I will add more components if required down the line.
Thanks for the read. Cheers.
Have a great weekend.