Showing posts from January, 2014

Hadoop - Hive - Load data from csv/xls files

We know that Hadoop helps to store mass data, process and analyze the same very fast. It is not easy for non java developers to extract and analyze the data from Hadoop framework but with the development of Hive any non java database developers can easily do the data analysis quickly. Hive is developed in Facebook labs and the syntax is more similar to all other Structured Query Languages such as Oracle , SQL server, MySQL etc. Pig which I have explained in my previous post can consider as alternative to Hive. Pig was developed at Yahoo! about the same time Facebook was developing Hive. Hive is used for OLAP purpose than the OLTP whereas Pig is considered as ETL language for Hadoop. To illustrate the Hive syntax and use, I thought it is best to load the data from Adventureworks DW database. I followed below steps to load data from Adventureworks database to file and to Hadoop eco system. 1. Pull the records from required tables to xlsx files 2. Browse the csv file from Hadoo

Hadoop - Pig

PIG is a script used in Hadoop to fetch data and generate different varieties of result set. Here I would like to explain the basic use of PIG script. Lets consider one table called  census_education which contains census education data for each county.  Fig 1 Concept here is slightly different from normal SQL. Filtering, grouping etc are not doing in one step instead multiple steps. In the first statement LOAD program pulls the record from the table census_education which is highlighted in red and store the output result into a variable called a. USING keyword refer the table metadata from the HCatalog while pulling the records from the table. Second statement pulls only required data for analysis and store it in another variable b. similarly third statement group the records and store it in another variable c. Foreach iterate through the entire record in the output result of c and find the average for highly educated peoples for each city ( city is a dimension c