Posts

Showing posts from March, 2019

Azure Databricks - bulk insert to Azure SQL

Image
Recently I had request to load the processed data into the azure SQL database from databricks. databricks program processing around 300 to 400 million records and aggregating those records into certain buckets. Even after the aggregation total number of records going inside the azure SQL database is 40 million. we found that the insertion is happening raw by raw and hence thought of doing the same using bulk insert option provided by the databricks. databricks provided super documentation on bulk insert and I just followed the same. to achieve this we need to create Spark connector library which can be done by using the upload option from the cluster. Click on the library option and provide the coordinate and create the library as mentioned in the below figure. once the library is created we used below code to execute the bulk insert. database name, user name, password, table name mentioned here are only for illustration purpose only. we had total

Databricks - Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow

We were trying to load data from Azure data lake store Raw layer to Azure data lake store analytics layer using hive query. Raw layer is having tables with text format whereas in analytics layer it is ORC format. This code is written inside the notebook and the notebook is called from Azure data factory. Raw layer gets the data from FTP and this layer is like landing area and hence all the data coming in from source systems are stored as it is without any modification. But at the time of loading the data, databricks throws the error and it says the error is caused by  java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow after some research we fond that this error is due to the difference in serede properties of source and destination tables. before  CREATE TABLE analytics_data.ay_contact (   contactsystemnumber                bigint ,   currencycode                              string ,   mailingaddr