Difference
between Azure Data Lake and Delta Lake
|
Azure Data Lake |
Azure Delta Lake |
Definition |
Data Lake Storage Gen2 allows you to
easily manage massive amounts of data. A fundamental part of Data Lake
Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The
hierarchical namespace organizes objects/files into a hierarchy of
directories for efficient data access. |
Delta lake is an open-source storage
layer from Spark which runs on top of an Azure Data Lake. Its core
functionalities bring reliability to the big data lakes by ensuring data
integrity with ACID transactions |
Data format |
All the raw data coming from
different sources can be stored in an Azure Data Lake without pre-defining a
schema for it. Azure Data Lake can contain all
types of data from many different sources without any need to processing it
first. |
|
Data integrity |
Azure Data Lake usually has multiple
data pipelines reading and writing data concurrently. It's hard to keep data
integrity due to how big data pipelines work |
|
ACID transactions |
If a pipeline fails while writing to
a data lake it causes the data to be partially written or corrupted which
highly affects the data quality. |
Delta is ACID compliant which means
that we can guarantee that a write operation either finishes completely or
not at all which avoids corrupted data to be written |
Unified batch and stream sources and
sinks |
There is no possibility to have
concurrent jobs reading and writing from/to the same data. |
With Delta, the same functions can
be applied to both batch and streaming data and with any change in the
business logic we can guarantee that the data is consistent in both sinks. |
Schema enforcement & Schema
evolution |
The Incoming data can change over
time. In a Data Lake this can result in data type compatibility issues,
corrupted data entering your data lake etc. |
With Delta, a different schema in
incoming data can be prevented from entering the table to avoid corrupting
the data. If enforcement isn’t needed, users
can easily change the schema of the data to intentionally adapt to the data
changing over time |
Time Travel |
In a Data lake, data is constantly
modified so if a data scientist wants to reproduce an experiment with the
same parameters from a week ago it would not be possible unless data is
copied multiple times |
With Delta, users can go back to an
older version of data for experiment reproduction, fixing wrong
updates/deletes or other transformations that resulted in bad data, audit
data etc. |
Prevent Data corruption |
|
A Delta Lake table may include NOT NULL
constraints on columns, which cannot be enforced on a regular Parquet
table. This prevents records from
being loaded with NULL values for columns which require data By using a MERGE statement, a
pipeline can be configured to INSERT a new record or ignore records that are
already present in the Delta Table. |
Query execution |
An expensive LIST operation on
the blob storage for each query |
Delta transaction log
serves as the manifest. |
Processing engine |
Apache Spark |
Delta engine - Photon, a native vectorization
engine written in C++. it’s optimized with performance
features like indexing, Delta Lake customers have seen ETL workloads execute
up to 48% faster. |
Achieve Compliance - New laws such
as GDPR and CCPA require that companies be able to purge data pertaining to a
customer should a request by the individual be made. |
|
Delta Lake includes DELETE and
UPDATE actions for the easy manipulation of data in a table. |
No comments:
Post a Comment