Thursday, 15 December 2022

Cannot bulk load because the file could not be opened. Operating System Error Code 5

Azure sql Bulk insert error

Issue -

Cannot bulk load because the file could not be opened. Operating System Error Code 5

Analysis

I followed below steps to do Bulk load but was getting above error

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'test1223$';

CREATE DATABASE SCOPED CREDENTIAL blobcred1

WITH IDENTITY = 'SHARED ACCESS SIGNATURE',

SECRET = 'sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D'

CREATE EXTERNAL DATA SOURCE CustomerDS4

WITH (

TYPE = BLOB_STORAGE,

LOCATION = 'https://loyaltyhyperscale.blob.core.windows.net/loyalty',

CREDENTIAL = blobcred1

);

BULK INSERT [dbo].[CustomerTable] FROM

'Customer.csv'

WITH (DATA_SOURCE = 'CustomerDS4', FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')

Resolution - To resolve the issue remove the ? in SAS token and Bingo...issue resolved

i,e use sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D

instead of

?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2022-12-28T18:26:28Z&st=2022-12-15T10:26:28Z&spr=https&sig=PLZSqBdl%2BHAu5eu0xMSKsy8sc3cQ%2FRKVh3JAX3y8amQ%3D

Azure Storage Gen 2 error - You do not have permissions to list the data using your user account with Azure AD. Click to learn more about authenticating with Azure AD.

Issue - You do not have permissions to list the data using your user account with Azure AD. Click to learn more about authenticating with Azure AD.

Resolution

If you are using AAD Token, this error is telling you that you need to add a role assignment to the user. Please go to Storage account -> Access Control -> Add -> Add role assignment, then add Storage Blob Data Owner to your login account.

Add the user name under the Storage Blob Data Owner

Friday, 25 November 2022

SSIS - The package file is missing from the project directory

Hi friends,

I got the above error when system restarted and upon reopening of visual studio ssis package ..my saved package was not found.

To resolve this issue

1. Open the file location by right clicking on project solution

2. Right click on package and click on Add package and choose your missing package.

This error occurs if the name of the SSIS package is recently changed

Bingo! Issue fixed

Thursday, 24 November 2022

SSIS - conditional split output evaluated to NULL and it should return Boolean results

Hi All,

Encountered above error when conditional split was used to implement SCD type 2 in SSIS using Look up

To resolve this use below code

(ISNULL(email) ? "1" : email) != (ISNULL(L_email) ? "1" : L_email) || (ISNULL(phone) ? "1" : phone) != (ISNULL(L_phone) ? "1" : L_phone)

i,e use ISNULL for both source and destination side so that the query does not return any Null values and finally is able to evaluate results as Boolean

Friday, 18 November 2022

SSIS - Exception deserializing the package

Hi Folks!

We get this error often with latest version of SSIS. To resolve this issue, follow below steps

1. Go to Task Manager
2. Details Tab.
3. Locate the process “DtsDebugHost.exe“.
4. Kill this process. There might be multiple instances of this process. Kill all of them.
5. Reexecute SSIS package

Ah ha...issue resolved

Wednesday, 14 September 2022

Power BI - Month is not getting sorted even when Sort by Month Number is applied

Hi Guys,

My team encountered the below error

Power BI - Month is not getting sorted even when Sort by Month Number is applied

Resolution - Check the data type for Month Number. In our case it was of "Text" type. When we changed to "Whole Number" it worked

Bingo!

Power BI - the column " " of the table wasn't found..while importing data from Azure Data Lake

Hi Folks,

I got the below error while trying to import data fro Azure BLOB

the column " " of the table wasn't found

Resolution - Check if there are any "Removed columns" steps in your Power Query steps. If yes, try to remove it and reload the data. It will work

2nd step - Try to refresh the table which you are trying to load

Thursday, 2 June 2022

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

Issue –

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

Resolution –

Downloaded the report and opened in Power BI desktop and then click on report – Refresh happens

Published this report and replaced the old report

Difference between Azure Analysis services and Azure Synapse

	Azure Analysis Services	Azure Synapse
Upgraded version	Azure Analysis Service is upgraded version of SQL server Analysis Service on the cloud.	Azure Synapse Analytic is upgraded version SQL DW + ML +Power BI that intergrates with ADLS + Data Bricks.
Concurrent users	More than 1000	128
DAX query support	Yes	No
Semantic model	Yes	Yes – Through Power BI model in Synapse Analytics
Dashboard development	Power BI	Power BI data set
Datamining	No	Yes
Power BI Connection type	Live connection	Direct/Import mode

Difference between Azure Data Lake and Delta Lake

	Azure Data Lake	Azure Delta Lake
Definition	Data Lake Storage Gen2 allows you to easily manage massive amounts of data. A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.	Delta lake is an open-source storage layer from Spark which runs on top of an Azure Data Lake. Its core functionalities bring reliability to the big data lakes by ensuring data integrity with ACID transactions
Data format	All the raw data coming from different sources can be stored in an Azure Data Lake without pre-defining a schema for it. Azure Data Lake can contain all types of data from many different sources without any need to processing it first.
Data integrity	Azure Data Lake usually has multiple data pipelines reading and writing data concurrently. It's hard to keep data integrity due to how big data pipelines work
ACID transactions	If a pipeline fails while writing to a data lake it causes the data to be partially written or corrupted which highly affects the data quality. Append only writes	Delta is ACID compliant which means that we can guarantee that a write operation either finishes completely or not at all which avoids corrupted data to be written
Unified batch and stream sources and sinks	There is no possibility to have concurrent jobs reading and writing from/to the same data.	With Delta, the same functions can be applied to both batch and streaming data and with any change in the business logic we can guarantee that the data is consistent in both sinks.
Schema enforcement & Schema evolution	The Incoming data can change over time. In a Data Lake this can result in data type compatibility issues, corrupted data entering your data lake etc.	With Delta, a different schema in incoming data can be prevented from entering the table to avoid corrupting the data. If enforcement isn’t needed, users can easily change the schema of the data to intentionally adapt to the data changing over time
Time Travel	In a Data lake, data is constantly modified so if a data scientist wants to reproduce an experiment with the same parameters from a week ago it would not be possible unless data is copied multiple times	With Delta, users can go back to an older version of data for experiment reproduction, fixing wrong updates/deletes or other transformations that resulted in bad data, audit data etc.
Prevent Data corruption		A Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table. This prevents records from being loaded with NULL values for columns which require data By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.
Query execution	An expensive LIST operation on the blob storage for each query	Delta transaction log serves as the manifest. The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics. These are the min and max values of each column that is found in the Parquet file footers. This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.
Processing engine	Apache Spark	Delta engine - It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. It has an enhanced query optimizer to speed up common query patterns. Photon, a native vectorization engine written in C++. it’s optimized with performance features like indexing, Delta Lake customers have seen ETL workloads execute up to 48% faster.
Achieve Compliance - New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made.		Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table.

Difference between Dedicated SQL Pool, Serverless Pool and Apache Spark Pool

	Dedicated SQL Pool	Serverless SQL Pool	Apache Spark pool
Architecture	It uses a node-based architecture where applications can connect and issue T-SQL to a control node. That control node is the single point of entry in Synapse SQL. Architecture It is build on a MPP engine, which stands for Massive Parallel Processing engine. Scaling can be done independently because the compute power and storage are decoupled. The control node on top is the brain of the MPP engine. It coordinates all compute activity.	It’s a node-based design based on the DDP system, or Distributed Data Processing system. In this system query’s are split into smaller query’s, and executed on the compute nodes. The control node on top utilized the distributed query engine to optimize the query’s for parallel processing. Each small query is called a “task” and represents a distributed execution unit.	Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Apache Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications.
Single source of truth	Yes	No
Scale up and down/Compute	Manual -Yes	Autoscale - No as it is managed by microsoft	Autoscale
Use cases	Power BI Direct query When you need a consistent level of performance and cost	To explore the data in ad-hoc modus You can analyze CSV, Parquet, Delta and other file formats stored in Azure Storage by using simple T-SQL query’s. Build a Logical Data Warehouse
		“Pay-per-query” or in other words you only pay for the amount of data that is processed.
Payment Model	Cost per hour (DWU) Charges are made for the resources reserved	Cost per TB of data processed Charges are made for the data processed on each query.
Data Retrieval	Fast: Distribution Strategy’s can be implemented (Round-robin, Hash, …), Query Caching.	Mediocre: Results cannot be cached, data needs to be fetched from storage.
Ingest data back to Datelake	Yes	No – you can only query data lake
Data storage	Relational database	Data lake

Tuesday, 10 May 2022

Difference between Azure Synapse and Azure SQL DB

Synapse

Azure SQL DB

Use Case

Azure Dedicated SQL pool (formerly SQL DW) best fit for online database query answering system (fit for OLAP and Data Warehousing).
Dedicated SQL pool best fit for Data Warehousing while Azure Synapse Analytics better fit for data mining

Dedicated SQL pool fits best for big data solution

Azure SQL Database is optimized for doing CRUD operations (Create, Read, Update and Delete) that you typically perform from an application.
Azure SQL Database is basically a solution for online database modifying system (fit for OLTP)

Azure SQL Database best fit for transactional processing

Architecture

Azure Synapse Analytics uses massively parallel processing (MPP).

This means that a query is processed by a dedicated node that has its own CPU and Memory.

Azure Dedicated SQL pool can be stored in multiple machines and provide better Scale demands.

Azure SQL uses symmetric multiprocessing (SMP)

Azure SQL DB is stored in single specific machine

Active geo-replication

Yes

Dynamic Data Masking

Yes

Data Encryption at rest

Yes

Polybase T-SQL queries

Yes

Automatic Tuning

Yes

Ability to pause and resume

Yes

Max amount of data per database

1PB

4TB

Max concurrent open sessions

1024

30000

Max concurrent queries

6400

Use for application database

Yes

Use for data warehouse with large amounts of data and small amounts of users

Yes

Use for data warehouse with data max 4TB and large amount of users

Yes

Triggers

Yes

Partitioning methods

Sharding, horizontal partitioning

tables can be distributed across several files (horizontal partitioning); sharding through federation

Foreign keys

Yes

Scalability

Scale Out (Horizontal)

Scale Up (Vertical)

Units of Performance

Data Warehouse Units (DWU) and Compute DWU (cDWU)

Database Transaction Units (DTU)

Max Size

Up to 1 Petabyte with compression (240TB on disk)

4TB

Pricing Range per month (approx.):

£900 - £130k but you are able to pause it to reduce cost

£4 - £15k per month

Backup & Recovery

Geo Redundancy Backup every 24hrs

Geo Redundant Backup every 5-10 minutes, point in time recovery, readable replicas

Concurrent Sessions

1024

30000

Concurrent Queries

6400

JSON, XML and Graph Support

YES

Encryption

TDE

Always Encrypted and TDE

In Memory OLTP

YES

Cross database queries

YES