Hadoop - Pig

PIG is a script used in Hadoop to fetch data and generate different varieties of result set. Here I would like to explain the basic use of PIG script. Lets consider one table called census_education which contains census education data for each county.

Fig 1

Concept here is slightly different from normal SQL. Filtering, grouping etc are not doing in one step instead multiple steps.

In the first statement LOAD program pulls the record from the table census_education which is highlighted in red and store the output result into a variable called a. USING keyword refer the table metadata from the HCatalog while pulling the records from the table.

Second statement pulls only required data for analysis and store it in another variable b. similarly third statement group the records and store it in another variable c.

Foreach iterate through the entire record in the output result of c and find the average for highly educated peoples for each city ( city is a dimension column in the output result of c).

finally we can push output data to table or other variable by using dump.


Post a Comment

Popular posts from this blog

Microsoft BI Implementation - Cube back up and restore using XMLA command

Hadoop - Hive - Load data from csv/xls files