Thursday, 2 June 2022

 

Difference between Dedicated SQL Pool, Serverless Pool and Apache Spark Pool

 

 

Dedicated SQL Pool

Serverless SQL Pool

Apache Spark pool

Architecture

It uses a node-based architecture where applications can connect and issue T-SQL to a control node. That control node is the single point of entry in Synapse SQL.   Architecture


It is build on a MPP engine, which stands for Massive Parallel Processing engine. Scaling can be done independently because the compute power and storage are decoupled. 

The control node on top is the brain of the MPP engine. It coordinates all compute activity.

It’s a node-based design based on the DDP system, or Distributed Data Processing system. In this system query’s are split into smaller query’s, and executed on the compute nodes. The control node on top utilized the distributed query engine to optimize the query’s for parallel processing. Each small query is called a “task” and represents a distributed execution unit.

 

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.


Apache Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications.

Single source of truth

Yes

No

 

Scale up and down/Compute

Manual -Yes

Autoscale - No as it is managed by microsoft

Autoscale

Use cases

Power BI Direct query

When you need a consistent level of performance and cost

To explore the data in ad-hoc modus

You can analyze CSV, Parquet, Delta and other file formats stored in Azure Storage by using simple T-SQL query’s.

 

Build a Logical Data Warehouse

 

 

 

 

 

“Pay-per-query” or in other words you only pay for the amount of data that is processed. 

 

Payment Model 

Cost per hour (DWU) 

Charges are made for the resources reserved

Cost per TB of data processed 

Charges are made for the data processed on each query.

 

Data Retrieval 

Fast: Distribution Strategy’s can be implemented (Round-robin, Hash, …), Query Caching.  

Mediocre: Results cannot be cached, data needs to be fetched from storage. 

 

Ingest data back to Datelake

Yes

No – you can only query data lake

 

Data storage

Relational database

Data lake

 

No comments:

Post a Comment