Posts

Showing posts from 2019

SSIS: Dealing with temp tables

Image
Temporary tables are the temp space available in the database to store data for further modifications. It is extensively used in the TSQL programming which is very convenient for the programmers. When it comes to SSIS we need to deal it with differently as the temporary tables life time is for only one session. SSIS throws error if the dataflow task used TSQL with temporary table. For e.g create one data flow task with below query which uses temporary table #ContactAddress to store data from person.Contact table of Adventureworks. #ContactAddress is the source table for ContactAddress in which data comes from the SSIS package. give the below query in one SSIS task and load the data into the table. create table #ContactAddress ( contactId int, firstname nvarchar(100), ) insert into #ContactAddress select ContactID,FirstName from Person.Contact ERROR:   SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occured. Error code: 0x80040E14. An OLEDB record is available

Migrating Azure data lake Gen1 to Gen2

Image
In this post, I would like to explain how we can migrate data from existing data lake gen1 to gen2. Azure data lake gen1 is optimized for big data analytics but storage is costlier than the gen2. Before Gen2, only option to store the archival data was in blob storage because of this low pricing. Gen2 is built on top of blob storage and hence the cost of storing in Gen2 is very cheap. gen2 is the combination of blob storage and ADLS. Microsoft combined blob storage and ADLS gen 1 into one platform to come up with ADLS gen2. Please see the video below to understand the migration details.

Call databricks scripts from Aure data factory

Image
Azure data factory is a wonderful tool to co-ordinate the entire EL and TL process. Using ADF, we can pull the data from the file stored in ADLS, FTP, web service, blob storage etc. transform the data using databricks/usql/hdinsight clusters and load into sqldb, azure sql dw, adls etc. Please go through the video below to understand how we can call databricks script from ADF and load the transformed output into Azure sql db.

Azure Databricks - bulk insert to Azure SQL

Image
Recently I had request to load the processed data into the azure SQL database from databricks. databricks program processing around 300 to 400 million records and aggregating those records into certain buckets. Even after the aggregation total number of records going inside the azure SQL database is 40 million. we found that the insertion is happening raw by raw and hence thought of doing the same using bulk insert option provided by the databricks. databricks provided super documentation on bulk insert and I just followed the same. to achieve this we need to create Spark connector library which can be done by using the upload option from the cluster. Click on the library option and provide the coordinate and create the library as mentioned in the below figure. once the library is created we used below code to execute the bulk insert. database name, user name, password, table name mentioned here are only for illustration purpose only. we had total

Databricks - Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow

We were trying to load data from Azure data lake store Raw layer to Azure data lake store analytics layer using hive query. Raw layer is having tables with text format whereas in analytics layer it is ORC format. This code is written inside the notebook and the notebook is called from Azure data factory. Raw layer gets the data from FTP and this layer is like landing area and hence all the data coming in from source systems are stored as it is without any modification. But at the time of loading the data, databricks throws the error and it says the error is caused by  java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow after some research we fond that this error is due to the difference in serede properties of source and destination tables. before  CREATE TABLE analytics_data.ay_contact (   contactsystemnumber                bigint ,   currencycode                              string ,   mailingaddr

Azure Databricks:- Read and write data into SQL dataase

Here in this post I would like to explain how we can connect SQL Server database from databricks to read and write. One of my customer project need this as the processed data is moving from Azure data lake layer to the aggregate layer which is SQL server database. Steps to connect to SQL server from databricks is clearly written in azure documentation but I would like to describe my experience. Code is developed in spark Scala Class.forName(" com.microsoft.sqlserver.jdbc.SQLServerDriver ") val jdbcUsername = dbutils.secrets.get(scope = " dev-cluster-scope ", key = " dev-sql-user ") val jdbcPassword = dbutils.secrets.get(scope = " dev-cluster-scope ", key = " dev-sql-pwd ") val jdbcHostname = " SQL Server Name here " val jdbcPort = 1433 val jdbcDatabase = " database name " - val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}" // Create a Properties()

Azure Databricks : Mounting to ADLS

Image
Databricks File System( DBFS) allows to store all the processed or unprocessed records into their file system. My customer is not ready to keep any data into DBFS as they believe it is not as secured as Azure data lake store (ADLS). ADLS is not mounted to Databricks by default and hence it is my turn to mount the ADLS to the source layer to store the data for Databricks to process and store. In order to continue with mounting of ADLS with databricks, make sure the below steps have completed successfully. 1. Install databricks 2. Install Azure data lake store 3. Register the databricks with azure active directory which is required to link the databricks with AD. Once you register the databricks app, will get service principleID and this ID should be provided at the time of mounting. lets go through the app registration process first. Steps: click on Azure active directory and select app registration from the left side of the window. now clic

Azure - Databricks (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation)

Image
I was trying to access the file stored in Azure data lake store from databricks. this is required to implement the business logic on top of the stored file. databricks helps to apply business logic to the big data quickly and I believe spark is the best tool available today to implement this. followed the documentation provided inside Microsoft as well as Databricks to complete app registration and mounting of data lake store from Databricks. App registration -  https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse Mount ADLS - https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html I received the error message ("ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation" ) when I tried to execute the below command sparkDF= spark.read.format('csv').options(header='true',inferSchema='true').load('/mnt/mount