Thursday, 2 June 2022

Difference between Azure Data Lake and Delta Lake

 

Difference between Azure Data Lake and Delta Lake

 

 

Azure Data Lake

Azure Delta Lake

Definition

Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.

Delta lake is an open-source storage layer from Spark which runs on top of an Azure Data Lake. Its core functionalities bring reliability to the big data lakes by ensuring data integrity with ACID transactions

Data format

All the raw data coming from different sources can be stored in an Azure Data Lake without pre-defining a schema for it.

Azure Data Lake can contain all types of data from many different sources without any need to processing it first.

 

Data integrity

Azure Data Lake usually has multiple data pipelines reading and writing data concurrently. It's hard to keep data integrity due to how big data pipelines work

 

ACID transactions

If a pipeline fails while writing to a data lake it causes the data to be partially written or corrupted which highly affects the data quality.
Append only writes

Delta is ACID compliant which means that we can guarantee that a write operation either finishes completely or not at all which avoids corrupted data to be written

 

Unified batch and stream sources and sinks

There is no possibility to have concurrent jobs reading and writing from/to the same data.

With Delta, the same functions can be applied to both batch and streaming data and with any change in the business logic we can guarantee that the data is consistent in both sinks.

Schema enforcement & Schema evolution

The Incoming data can change over time. In a Data Lake this can result in data type compatibility issues, corrupted data entering your data lake etc.

With Delta, a different schema in incoming data can be prevented from entering the table to avoid corrupting the data.

 

If enforcement isn’t needed, users can easily change the schema of the data to intentionally adapt to the data changing over time

 

 

Time Travel

In a Data lake, data is constantly modified so if a data scientist wants to reproduce an experiment with the same parameters from a week ago it would not be possible unless data is copied multiple times

With Delta, users can go back to an older version of data for experiment reproduction, fixing wrong updates/deletes or other transformations that resulted in bad data, audit data etc.

Prevent Data corruption

 

A Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table.  This prevents records from being loaded with NULL values for columns which require data

 

By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.

 

Query execution

An expensive LIST operation on the blob storage for each query

 Delta transaction log serves as the manifest.
The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics.  These are the min and max values of each column that is found in the Parquet file footers.  This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.

 

Processing engine

Apache Spark

Delta engine -
 It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. 

It has an enhanced query optimizer to speed up common query patterns. 

 

Photon, a native vectorization engine written in C++. 

 

it’s optimized with performance features like indexing, Delta Lake customers have seen ETL workloads execute up to 48% faster.

Achieve Compliance - New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made.

 

Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table.

No comments:

Post a Comment