Azure - DWH polybase [Access file data from DWH)

Polybase is the one of the main feature in DWH amazed me a lot due to its easiness of exploring the data from blob storage.

This helps in keeping the raw data as it is in the blob/data lake folders and analyze these data inside the DWH. below diagram provides the illustration on how data can be consumed from the sources till the visualization tools.





all the raw files can be stored in one folder and polybase enables the user to query the data in all these files as a single unit. 

first step is to create the master key and create this only if it is not exists.
keyword 
CREATE MASTER KEY;

next step is to create the credentials to connect with blob storage. this steps accepts the storage key and abstract from the use.

Keyword
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY
= '<Name of the identity>',
   
SECRET = '<storage key from storage account>'
;

next step is to create the data source to point the storage account. Polybase use Hadoop APIs to access data in azure blob storage. credentials created above will be used here.

Keyword

CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE
= HADOOP,
    LOCATION = 'wasbs://adf-data@jkbstorage.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);


Next step is to configure the file format. this is like defining the metadata of the files stored inside blob/data lake storage.

Keyword

CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(   FORMAT_TYPE
= DELIMITEDTEXT
,    FORMAT_OPTIONS    (   FIELD_TERMINATOR
= ','
                    ,    STRING_DELIMITER = ''
                    ,    DATE_FORMAT         = 'yyyy-MM-dd HH:mm:ss.fff'
                    ,    USE_TYPE_DEFAULT = FALSE
                    )
);
Next step is to create the schema to define the external tables. this external tables act as pointer to the file storage.

keyword

CREATE EXTERNAL TABLE [asb].Transactions (
    [LoadDate] varchar(100) NULL,
    Amount Varchar(100) NULL
)
WITH
(
    LOCATION
='/Transactions/'
,   DATA_SOURCE = AzureStorage
,   FILE_FORMAT = TextFileFormat
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
);

now DWH is ready to query the data from the external tables. Please note that while creating the tables, it should mention the file format and data_source which we have created before the table creation. use the color to identify them.

here is the obect explorer showing the Azure storage, TextFileFormat and schema.

















Query the file using normal sql statement.

select * from asb.transactions  --> this query go the azure storage using the source mentioned and get back the data using the file format mentioned while creating the table.


Please note that this method will be time consuming in case of peta bytes of data and hence to improve the query performance, store the file data to DWH tables. it is easy to load the data into DWH tables using CTAS method. CTAS method is faster than select into statement as it is minimally logged.

CREATE TABLE [dbo].[Fct_Transactions]
WITH
(
    DISTRIBUTION
= ROUND_ROBIN
,   CLUSTERED COLUMNSTORE INDEX

)
AS
SELECT 
     *
FROM
asb.transactions;
Result


















Reference:  https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-load-from-azure-blob-storage-with-polybase

Comments

Popular posts from this blog

Microsoft BI Implementation - Cube back up and restore using XMLA command

Databricks - incorrect header check