Thursday, 2 June 2022

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

 

Issue –

You can't schedule refresh for this dataset because one or more sources currently don't support refresh

 

 

Resolution –

Downloaded the report and opened in Power BI desktop and then click on report – Refresh happens

Published this report and replaced the old report

Difference between Azure Analysis services and Azure Synapse

 

Difference between Azure Analysis services and Azure Synapse

 

 

Azure Analysis Services

Azure Synapse

Upgraded version

Azure Analysis Service is upgraded version of SQL server Analysis Service on the cloud.

Azure Synapse Analytic is upgraded version SQL DW + ML +Power BI that intergrates with ADLS + Data Bricks.

Concurrent users

More than 1000

128

DAX query support

Yes

No

Semantic model

Yes

Yes – Through Power BI model in Synapse Analytics

Dashboard development

Power BI

Power BI data set

Datamining

No

Yes

Power BI Connection type

Live connection

Direct/Import mode

 

Difference between Azure Data Lake and Delta Lake

 

Difference between Azure Data Lake and Delta Lake

 

 

Azure Data Lake

Azure Delta Lake

Definition

Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.

Delta lake is an open-source storage layer from Spark which runs on top of an Azure Data Lake. Its core functionalities bring reliability to the big data lakes by ensuring data integrity with ACID transactions

Data format

All the raw data coming from different sources can be stored in an Azure Data Lake without pre-defining a schema for it.

Azure Data Lake can contain all types of data from many different sources without any need to processing it first.

 

Data integrity

Azure Data Lake usually has multiple data pipelines reading and writing data concurrently. It's hard to keep data integrity due to how big data pipelines work

 

ACID transactions

If a pipeline fails while writing to a data lake it causes the data to be partially written or corrupted which highly affects the data quality.
Append only writes

Delta is ACID compliant which means that we can guarantee that a write operation either finishes completely or not at all which avoids corrupted data to be written

 

Unified batch and stream sources and sinks

There is no possibility to have concurrent jobs reading and writing from/to the same data.

With Delta, the same functions can be applied to both batch and streaming data and with any change in the business logic we can guarantee that the data is consistent in both sinks.

Schema enforcement & Schema evolution

The Incoming data can change over time. In a Data Lake this can result in data type compatibility issues, corrupted data entering your data lake etc.

With Delta, a different schema in incoming data can be prevented from entering the table to avoid corrupting the data.

 

If enforcement isn’t needed, users can easily change the schema of the data to intentionally adapt to the data changing over time

 

 

Time Travel

In a Data lake, data is constantly modified so if a data scientist wants to reproduce an experiment with the same parameters from a week ago it would not be possible unless data is copied multiple times

With Delta, users can go back to an older version of data for experiment reproduction, fixing wrong updates/deletes or other transformations that resulted in bad data, audit data etc.

Prevent Data corruption

 

A Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table.  This prevents records from being loaded with NULL values for columns which require data

 

By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.

 

Query execution

An expensive LIST operation on the blob storage for each query

 Delta transaction log serves as the manifest.
The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics.  These are the min and max values of each column that is found in the Parquet file footers.  This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.

 

Processing engine

Apache Spark

Delta engine -
 It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. 

It has an enhanced query optimizer to speed up common query patterns. 

 

Photon, a native vectorization engine written in C++. 

 

it’s optimized with performance features like indexing, Delta Lake customers have seen ETL workloads execute up to 48% faster.

Achieve Compliance - New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made.

 

Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table.

 

Difference between Dedicated SQL Pool, Serverless Pool and Apache Spark Pool

 

 

Dedicated SQL Pool

Serverless SQL Pool

Apache Spark pool

Architecture

It uses a node-based architecture where applications can connect and issue T-SQL to a control node. That control node is the single point of entry in Synapse SQL.   Architecture


It is build on a MPP engine, which stands for Massive Parallel Processing engine. Scaling can be done independently because the compute power and storage are decoupled. 

The control node on top is the brain of the MPP engine. It coordinates all compute activity.

It’s a node-based design based on the DDP system, or Distributed Data Processing system. In this system query’s are split into smaller query’s, and executed on the compute nodes. The control node on top utilized the distributed query engine to optimize the query’s for parallel processing. Each small query is called a “task” and represents a distributed execution unit.

 

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.


Apache Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications.

Single source of truth

Yes

No

 

Scale up and down/Compute

Manual -Yes

Autoscale - No as it is managed by microsoft

Autoscale

Use cases

Power BI Direct query

When you need a consistent level of performance and cost

To explore the data in ad-hoc modus

You can analyze CSV, Parquet, Delta and other file formats stored in Azure Storage by using simple T-SQL query’s.

 

Build a Logical Data Warehouse

 

 

 

 

 

“Pay-per-query” or in other words you only pay for the amount of data that is processed. 

 

Payment Model 

Cost per hour (DWU) 

Charges are made for the resources reserved

Cost per TB of data processed 

Charges are made for the data processed on each query.

 

Data Retrieval 

Fast: Distribution Strategy’s can be implemented (Round-robin, Hash, …), Query Caching.  

Mediocre: Results cannot be cached, data needs to be fetched from storage. 

 

Ingest data back to Datelake

Yes

No – you can only query data lake

 

Data storage

Relational database

Data lake