Showing posts from 2020

Schema on dataframe

Providing schema while pulling the data from file is one of the small step to increase your databricks application performance. providing schema to the statement helps spark engine to know the data types of the fields in the file in advance and hence no need to do through the data to finalize it.

Dataframe introduction

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. I tried to explain the creation of dataframe using csv file and manipulate the data and store the processed records into another file or table for further processing. Data transformation using spark data frame is very easy and spark provided various functions to help the transformation. please go through spark documentation for more detail I used databricks community edition for this demo.

Azure Data Factory - Pull files from SFTP

Recently I had a request to pull the data from Linux based SFTP from my customer. ADF is not able to connect the SFTP due to firewall settings and we had discussion with Microsoft to get the solution. But unfortunately Microsoft said we need to wait couple of months to get the solution. I came up with another solution as business is not ready to wait till Microsoft help us. We have SSIS license already and I thought of making use of it. Below is the high level architecture that I proposed. Use SSIS to pull the file from Linux SFTP and download into local folder. For each feed separate folder is created and the files are downloaded based on the last modified date. ADF pick all the files from windows FTP based on the date and loop through each file and load into Azure data lake store  RAW layer and then later to analytic layer. RAW layer to Analytic layer processing is done using databricks script which is called inside ADF. EST_GET_FEEDS task hit

Databricks - incorrect header check

This post I would like to show you how we can fix the problem of " Incorrect header check " received while fetching the data from hive table. Actual message " SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 4 times, most recent failure: Lost task 0.3 in stage 63.0 (TID 3506,, executor 28): incorrect header check.   at Method)     at     at     at     at " To better understand the scenario, let me explain how the data got loaded in the hive table. we are storing the data from source systems to RAW layer. H