Showing posts from March, 2020

Schema on dataframe

Providing schema while pulling the data from file is one of the small step to increase your databricks application performance. providing schema to the statement helps spark engine to know the data types of the fields in the file in advance and hence no need to do through the data to finalize it.

Dataframe introduction

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. I tried to explain the creation of dataframe using csv file and manipulate the data and store the processed records into another file or table for further processing. Data transformation using spark data frame is very easy and spark provided various functions to help the transformation. please go through spark documentation for more detail I used databricks community edition for this demo.