Thursday, 15 December 2022

Cannot bulk load because the file could not be opened. Operating System Error Code 5

Azure sql  Bulk insert error 

Issue - 

Cannot bulk load because the file could not be opened. Operating System Error Code 5


Analysis
I followed below steps to do Bulk load but was getting above error

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'test1223$';

CREATE DATABASE SCOPED CREDENTIAL blobcred1
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D'

CREATE EXTERNAL DATA SOURCE CustomerDS4
    WITH (
        TYPE = BLOB_STORAGE,
        LOCATION = 'https://loyaltyhyperscale.blob.core.windows.net/loyalty',
        CREDENTIAL = blobcred1
    );


BULK INSERT [dbo].[CustomerTable] FROM 
'Customer.csv' 
WITH (DATA_SOURCE = 'CustomerDS4', FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')


Resolution - To resolve the issue remove the ? in SAS token and Bingo...issue resolved

i,e use sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D

instead of 
?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D

Azure Storage Gen 2 error - You do not have permissions to list the data using your user account with Azure AD. Click to learn more about authenticating with Azure AD.

 Issue - You do not have permissions to list the data using your user account with Azure AD. Click to learn more about authenticating with Azure AD.

Resolution

If you are using AAD Token, this error is telling you that you need to add a role assignment to the user. Please go to Storage account -> Access Control -> Add -> Add role assignment, then add Storage Blob Data Owner to your login account.


Add the user name under the Storage Blob Data Owner
32345-capture.png

Friday, 25 November 2022

SSIS - The package file is missing from the project directory

 Hi friends,

I got the above error when system restarted and upon reopening of visual studio ssis package ..my saved package was not found. 

To resolve this issue 

1. Open the file location by right clicking on project solution

2. Right click on package and click on Add package and choose your missing package.


This error occurs if the name of the SSIS package is recently changed


Bingo! Issue fixed

Thursday, 24 November 2022

SSIS - conditional split output evaluated to NULL and it should return Boolean results

 Hi All,

Encountered above error when conditional split was used to implement SCD type 2 in SSIS using Look up


To resolve this use below code 


(ISNULL(email) ? "1" : email) != (ISNULL(L_email) ? "1" : L_email) || (ISNULL(phone) ? "1" : phone) != (ISNULL(L_phone) ? "1" : L_phone)



i,e use ISNULL for both source and destination side so that the query does not return any Null values and finally is able to evaluate results as Boolean

Friday, 18 November 2022

SSIS - Exception deserializing the package

 Hi Folks!

We get this error often with latest version of SSIS. To resolve this issue, follow below steps 


1. Go to Task Manager
2. Details Tab.
3. Locate the process “DtsDebugHost.exe“.
4. Kill this process. There might be multiple instances of this process. Kill all of them.
5. Reexecute SSIS package


Ah ha...issue resolved

Wednesday, 14 September 2022

Power BI - Month is not getting sorted even when Sort by Month Number is applied

 Hi Guys,

My team encountered the below error 

Power BI - Month is not getting sorted even when Sort by Month Number is applied


Resolution - Check the data type for Month Number. In our case it was of "Text" type. When we changed to "Whole Number" it worked 


Bingo!

Power BI - the column " " of the table wasn't found..while importing data from Azure Data Lake

 Hi Folks,

I got the below error while trying to import data fro Azure BLOB


the column " " of the table wasn't found


Resolution - Check if there are any "Removed columns" steps in your Power Query steps. If yes, try to remove it and reload the data. It will work

2nd step - Try to refresh the table which you are trying to load

Thursday, 2 June 2022

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

 

Issue –

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

 

 

Resolution –

Downloaded the report and opened in Power BI desktop and then click on report – Refresh happens

Published this report and replaced the old report

Difference between Azure Analysis services and Azure Synapse

 

Difference between Azure Analysis services and Azure Synapse

 

 

Azure Analysis Services

Azure Synapse

Upgraded version

Azure Analysis Service is upgraded version of SQL server Analysis Service on the cloud.

Azure Synapse Analytic is upgraded version SQL DW + ML +Power BI that intergrates with ADLS + Data Bricks.

Concurrent users

More than 1000

128

DAX query support

Yes

No

Semantic model

Yes

Yes – Through Power BI model in Synapse Analytics

Dashboard development

Power BI

Power BI data set

Datamining

No

Yes

Power BI Connection type

Live connection

Direct/Import mode

 

Difference between Azure Data Lake and Delta Lake

 

Difference between Azure Data Lake and Delta Lake

 

 

Azure Data Lake

Azure Delta Lake

Definition

Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.

Delta lake is an open-source storage layer from Spark which runs on top of an Azure Data Lake. Its core functionalities bring reliability to the big data lakes by ensuring data integrity with ACID transactions

Data format

All the raw data coming from different sources can be stored in an Azure Data Lake without pre-defining a schema for it.

Azure Data Lake can contain all types of data from many different sources without any need to processing it first.

 

Data integrity

Azure Data Lake usually has multiple data pipelines reading and writing data concurrently. It's hard to keep data integrity due to how big data pipelines work

 

ACID transactions

If a pipeline fails while writing to a data lake it causes the data to be partially written or corrupted which highly affects the data quality.
Append only writes

Delta is ACID compliant which means that we can guarantee that a write operation either finishes completely or not at all which avoids corrupted data to be written

 

Unified batch and stream sources and sinks

There is no possibility to have concurrent jobs reading and writing from/to the same data.

With Delta, the same functions can be applied to both batch and streaming data and with any change in the business logic we can guarantee that the data is consistent in both sinks.

Schema enforcement & Schema evolution

The Incoming data can change over time. In a Data Lake this can result in data type compatibility issues, corrupted data entering your data lake etc.

With Delta, a different schema in incoming data can be prevented from entering the table to avoid corrupting the data.

 

If enforcement isn’t needed, users can easily change the schema of the data to intentionally adapt to the data changing over time

 

 

Time Travel

In a Data lake, data is constantly modified so if a data scientist wants to reproduce an experiment with the same parameters from a week ago it would not be possible unless data is copied multiple times

With Delta, users can go back to an older version of data for experiment reproduction, fixing wrong updates/deletes or other transformations that resulted in bad data, audit data etc.

Prevent Data corruption

 

A Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table.  This prevents records from being loaded with NULL values for columns which require data

 

By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.

 

Query execution

An expensive LIST operation on the blob storage for each query

 Delta transaction log serves as the manifest.
The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics.  These are the min and max values of each column that is found in the Parquet file footers.  This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.

 

Processing engine

Apache Spark

Delta engine -
 It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. 

It has an enhanced query optimizer to speed up common query patterns. 

 

Photon, a native vectorization engine written in C++. 

 

it’s optimized with performance features like indexing, Delta Lake customers have seen ETL workloads execute up to 48% faster.

Achieve Compliance - New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made.

 

Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table.

 

Difference between Dedicated SQL Pool, Serverless Pool and Apache Spark Pool

 

 

Dedicated SQL Pool

Serverless SQL Pool

Apache Spark pool

Architecture

It uses a node-based architecture where applications can connect and issue T-SQL to a control node. That control node is the single point of entry in Synapse SQL.   Architecture


It is build on a MPP engine, which stands for Massive Parallel Processing engine. Scaling can be done independently because the compute power and storage are decoupled. 

The control node on top is the brain of the MPP engine. It coordinates all compute activity.

It’s a node-based design based on the DDP system, or Distributed Data Processing system. In this system query’s are split into smaller query’s, and executed on the compute nodes. The control node on top utilized the distributed query engine to optimize the query’s for parallel processing. Each small query is called a “task” and represents a distributed execution unit.

 

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.


Apache Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications.

Single source of truth

Yes

No

 

Scale up and down/Compute

Manual -Yes

Autoscale - No as it is managed by microsoft

Autoscale

Use cases

Power BI Direct query

When you need a consistent level of performance and cost

To explore the data in ad-hoc modus

You can analyze CSV, Parquet, Delta and other file formats stored in Azure Storage by using simple T-SQL query’s.

 

Build a Logical Data Warehouse

 

 

 

 

 

“Pay-per-query” or in other words you only pay for the amount of data that is processed. 

 

Payment Model 

Cost per hour (DWU) 

Charges are made for the resources reserved

Cost per TB of data processed 

Charges are made for the data processed on each query.

 

Data Retrieval 

Fast: Distribution Strategy’s can be implemented (Round-robin, Hash, …), Query Caching.  

Mediocre: Results cannot be cached, data needs to be fetched from storage. 

 

Ingest data back to Datelake

Yes

No – you can only query data lake

 

Data storage

Relational database

Data lake

 

Tuesday, 10 May 2022

Difference between Azure Synapse and Azure SQL DB

 

 

Synapse

Azure SQL DB

Use Case

Azure Dedicated SQL pool (formerly SQL DW) best fit for online database query answering system (fit for OLAP and Data Warehousing).
Dedicated SQL pool best fit for Data Warehousing while Azure Synapse Analytics better fit for data mining


 Dedicated SQL pool fits best for big data solution



Azure SQL Database is optimized for doing CRUD operations (Create, Read, Update and Delete) that you typically perform from an application. 
Azure SQL Database is basically a solution for online database modifying system (fit for OLTP)

Azure SQL Database best fit for transactional processing

Architecture

Azure Synapse Analytics uses  massively parallel processing (MPP).

 This means that a query is processed by a dedicated node that has its own CPU and Memory.

 

 

Azure Dedicated SQL pool can be stored in multiple machines and provide better Scale demands.

Azure SQL uses symmetric multiprocessing (SMP)


Azure SQL DB is stored in single specific machine

Active geo-replication

No

 

Yes

Dynamic Data Masking

No

 

Yes

Data Encryption at rest

Yes

 

Yes

Polybase T-SQL queries

Yes

 

Yes

Automatic Tuning

No

 

Yes

Ability to pause and resume

Yes

 

No

 

Max amount of data per database

1PB

 

4TB

 

Max concurrent open sessions

1024

 

30000

 

Max concurrent queries

32

 

6400

 

Use for application database

No

 

Yes

Use for data warehouse with large amounts of data and small amounts of users

Yes

 

No

Use for data warehouse with data max 4TB and large amount of users

No

 

Yes

Triggers

No

Yes

Partitioning methods

Sharding, horizontal partitioning

tables can be distributed across several files (horizontal partitioning); sharding through federation

Foreign keys

No

Yes

Scalability

 

Scale Out (Horizontal)

 

Scale Up (Vertical)

 

Units of Performance

 

Data Warehouse Units (DWU) and Compute DWU (cDWU)

 

Database Transaction Units (DTU)

 

Max Size

 

Up to 1 Petabyte with compression (240TB on disk)

 

4TB

 

Pricing Range per month (approx.):

 

£900 - £130k but you are able to pause it to reduce cost

 

£4 - £15k per month

 

Backup & Recovery

 

Geo Redundancy Backup every 24hrs

 

Geo Redundant Backup every 5-10 minutes, point in time recovery, readable replicas

 

Concurrent Sessions

 

1024

30000

 

Concurrent Queries

 

32

6400

 

JSON, XML and Graph Support

 

No

 

YES

 

Encryption

 

TDE

Always Encrypted and TDE

 

In Memory OLTP

 

No

YES

 

Cross database queries

 

No

YES