Databricks - incorrect header check

This post I would like to show you how we can fix the problem of "Incorrect header check" received while fetching the data from hive table.

Actual message "
SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 4 times, most recent failure: Lost task 0.3 in stage 63.0 (TID 3506,, executor 28): incorrect header check.

 at Method)





To better understand the scenario, let me explain how the data got loaded in the hive table. we are storing the data from source systems to RAW layer. Here raw layer is landing area where we are picking the file from source system and loading as it is. Only difference from source and RAW is the compressed format in RAW.

Now look at the table script below.

drop table   if exists  raw_data.rw_aa_addisplay;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
create table raw_data.rw_addisplay
    day                       string ,
    url                        string,
    url_clean              string,
    page_country       string ,
PARTITIONED BY (file_dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
   "separatorChar" = ",",
   "quoteChar"     = "\""


Hive table can read csv files from this location even if the files are compressed in gzip format.

here comes the issue - while loading the file from source to raw location, pulled the file and stored  as csv file with the extension  as .gz which is gzip format. File is not actually gzip but extension. this made hive in confused state and generated the above issue.

Once the file compressed and stored in the location , this issue got resolved.


Post a Comment

Popular posts from this blog

Microsoft BI Implementation - Cube back up and restore using XMLA command