Databricks - Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow

We were trying to load data from Azure data lake store Raw layer to Azure data lake store analytics layer using hive query. Raw layer is having tables with text format whereas in analytics layer it is ORC format. This code is written inside the notebook and the notebook is called from Azure data factory.

Raw layer gets the data from FTP and this layer is like landing area and hence all the data coming in from source systems are stored as it is without any modification.

But at the time of loading the data, databricks throws the error and it says the error is caused by  java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow


after some research we fond that this error is due to the difference in serede properties of source and destination tables.

before 

CREATE TABLE analytics_data.ay_contact
(
  contactsystemnumber                bigint,
  currencycode                              string,
  mailingaddresscountrycode       string,
  mailingadpoliticaldivision1cd   string,
  file_load_date                            date
)
PARTITIONED BY (
  file_dt date)
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','

STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  '/mnt/analytics/contact'
TBLPROPERTIES (

  'transient_lastDdlTime'='1527633997');
error went away once we changed the row format to serede from delimited.
after

CREATE TABLE analytics_data.ay_contact
(
  contactsystemnumber                bigint,
  currencycode                              string,
  mailingaddresscountrycode       string,
  mailingadpoliticaldivision1cd   string,
  file_load_date                            date
)
PARTITIONED BY (
  file_dt date)
 ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  '/mnt/analytics/contact'
TBLPROPERTIES (

  'transient_lastDdlTime'='1527633997');





Comments

  1. This is really nice post, I found and love this content. I will prefer this, thanks for sharing. Data Analytics For Business.

    ReplyDelete

Post a Comment

Popular posts from this blog

Microsoft BI Implementation - Cube back up and restore using XMLA command

Databricks - incorrect header check