Posts

Schema on dataframe

Image
Providing schema while pulling the data from file is one of the small step to increase your databricks application performance. providing schema to the statement helps spark engine to know the data types of the fields in the file in advance and hence no need to do through the data to finalize it.

Dataframe introduction

Image
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. I tried to explain the creation of dataframe using csv file and manipulate the data and store the processed records into another file or table for further processing. Data transformation using spark data frame is very easy and spark provided various functions to help the transformation. please go through spark documentation for more detail I used databricks community edition for this demo.

Azure Data Factory - Pull files from SFTP

Image
Recently I had a request to pull the data from Linux based SFTP from my customer. ADF is not able to connect the SFTP due to firewall settings and we had discussion with Microsoft to get the solution. But unfortunately Microsoft said we need to wait couple of months to get the solution. I came up with another solution as business is not ready to wait till Microsoft help us. We have SSIS license already and I thought of making use of it. Below is the high level architecture that I proposed. Use SSIS to pull the file from Linux SFTP and download into local folder. For each feed separate folder is created and the files are downloaded based on the last modified date. ADF pick all the files from windows FTP based on the date and loop through each file and load into Azure data lake store  RAW layer and then later to analytic layer. RAW layer to Analytic layer processing is done using databricks script which is called inside ADF. EST_GET_FEEDS task hit

Databricks - incorrect header check

This post I would like to show you how we can fix the problem of " Incorrect header check " received while fetching the data from hive table. Actual message " SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 4 times, most recent failure: Lost task 0.3 in stage 63.0 (TID 3506, 10.1.1.7, executor 28): java.io.IOException: incorrect header check.   at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)     at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)     at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)     at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)     at java.io.InputStream.read(InputStream.java:101) " To better understand the scenario, let me explain how the data got loaded in the hive table. we are storing the data from source systems to RAW layer. H

SSIS: Dealing with temp tables

Image
Temporary tables are the temp space available in the database to store data for further modifications. It is extensively used in the TSQL programming which is very convenient for the programmers. When it comes to SSIS we need to deal it with differently as the temporary tables life time is for only one session. SSIS throws error if the dataflow task used TSQL with temporary table. For e.g create one data flow task with below query which uses temporary table #ContactAddress to store data from person.Contact table of Adventureworks. #ContactAddress is the source table for ContactAddress in which data comes from the SSIS package. give the below query in one SSIS task and load the data into the table. create table #ContactAddress ( contactId int, firstname nvarchar(100), ) insert into #ContactAddress select ContactID,FirstName from Person.Contact ERROR:   SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occured. Error code: 0x80040E14. An OLEDB record is available

Migrating Azure data lake Gen1 to Gen2

Image
In this post, I would like to explain how we can migrate data from existing data lake gen1 to gen2. Azure data lake gen1 is optimized for big data analytics but storage is costlier than the gen2. Before Gen2, only option to store the archival data was in blob storage because of this low pricing. Gen2 is built on top of blob storage and hence the cost of storing in Gen2 is very cheap. gen2 is the combination of blob storage and ADLS. Microsoft combined blob storage and ADLS gen 1 into one platform to come up with ADLS gen2. Please see the video below to understand the migration details.