Post 10 | HDPCD | Load Pig Relation WITHOUT schema

 

Hello everyone, hope you are finding the tutorials useful. In the previous tutorial, we started off with Data Transformation category of the HDPCD certification. This tutorial, being the second objective in this category, focuses on creating a sample pig relation without the schema. Before, starting with the actual process, let us define what is relation and schema in Apache Pig.

Pig Relation: In simplest terminologies, a relation in Apache Pig is equivalent to the table in Relational Databases. A relation in Apache Pig contains data which is loaded from the file available in either the local file system or HDFS. While loading the Pig Relation with data, it is up to you to define the schema or not. If you do not define the schema, then it will create the Pig Relation with the default schema, which we are going to see as this tutorial’s objective.

Pig Schema: A pig schema defines the name of the field and the datatype of each field in the Pig Relation. It is up to you to define the name and datatype of each field while defining the schema. All of these column names and datatypes collectively make up a schema. I know I am reiterating this, but if you do not define the schema of a Pig Relation, then Pig will automatically define the default field name and data type, as we will see in just a few minutes.

Let us get started, then.

Global Picture: apache-pig-schema-less-relation
Global Picture: apache-pig-schema-less-relation
  • CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM

We are going to use vi editor to create this input file.

vi input.csv

######
PASTE COPIED CONTENTS HERE
######

cat input.csv

The following screenshot gives you more idea.

Step 1: Creating Input File Content
Step 1: Creating Input File Content

And

Step 1: Input File Content
Step 1: Input File Content
  • PUSHING INPUT CSV FILE TO HDFS

Please use the following commands to push this input.csv from the local file system to HDFS.

hadoop fs -mkdir /hdpcd/input/post10

hadoop fs -put input.csv /hdpcd/input/post10

hadoop fs -cat /hdpcd/input/post10/input.csv

The following screenshot might come handy for this.

Step 2: Pushing Input File to HDFS
Step 2: Pushing Input File to HDFS

Now is the time to create the pig script.

  • PIG SCRIPT CREATION

Please use the following command to create this pig script.

vi post10.pig

######

PASTE THE CONTENTS HERE

######

cat post10.pig

The following screenshot helps you understand this.

Step 3: Pig Script Creation
Step 3: Pig Script Creation
  • RUNNING PIG SCRIPT

The following command is used for running this pig script.

pig -f post10.pig

It looks as follows.

Step 4: Running Pig Script
Step 4: Running Pig Script

And the output of the pig script.

Step 5: Pig Script Output
Step 5: Pig Script Output

This concludes the tutorial.

 

 

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s