Table of Contents
Azure Data Factory is a powerful tool, offering a vast range of possibilities, though it can be a bit intricate in terms of pricing and architecture. It’s very dear to me, as I appreciate it for many improvements and conveniences compared to its ancestor, SQL Server Integration Services.
However, years of experience with both tools and working with clients compel me to write a highly significant article, the purpose of which is to highlight, in my opinion, the most crucial weaknesses of ADFv2 service. They lead to widespread confusion, misunderstandings, and, at times, unjust disappointment.
Variable Performance in Azure Data Factory Pipelines
Quite recently, I wrote an article highlighting that the Azure Data Factory developers anticipated as much as 4 minutes for the launch of each and any activity in pipelines in the SLA. That’s a lot, especially since such time should be summed up for all existing activities in pipelines to obtain a composite SLA.
Testing Scenarios: Assessing Pipeline Performance Variability Across Different Regions
Nevertheless, I decided to set up three pipelines, which do not transfer any data and run extremely trivial logic independent of any sources nor destinations:
- PL_SimpleWait – Just a WAIT activity that runs a delay for one second, nothing more.
- PL_SimpleParallel500SetVar – Just one ForEach activity, that runs 500 times and SET particular pipeline VARIABLE to zero. Iteration is done using a simple expression: @range(1,500). Also, all activities settings are set to default (including ForEach empty batch count)
- PL_SimpleParallelSeq500SetVar– Exactly the same logic as above, but with “Sequential” set to true.
And I wanted to test their running times in four different regions:
- West Europe (Amsterdam, Netherlands) – which is one of the most heavily loaded regions in all of Europe.
- East US (Richmond, Virginia) – similarly to WE, one of the most heavily loaded in the USA.
- Central India (Pune) – though it’s not known to me, I assumed it would have average loads.
- Poland Central (Sękocin, Warsaw) – a brand new region (opened in 2023), where ADF was launched just recently (at the end of October 2023)
I connected them all to the same Git repository and published absolutely the same pipeline code, including the triggers that start them, one after another, three times a day (UTC): 7am, 5pm and 11pm for five days: between Thursday and Monday (from 2nd to 6th of November, 2023) on my MSDN Subscription (OfferID MS-AZR-0029P).
All Data Factories were configured to send logs into the one, central Log Analytics workspace.
Here is the aggregated result of all runs:
| where Status == "Succeeded"
| where PipelineName startswith "PL_Simple"
| extend Duration = End - Start // Calculate duration
| summarize CountRuns=count(), // Number of runs
AverageDuration=format_timespan(avg(Duration),"hh:mm:ss"), // Average duration
MedianDuration=percentile(Duration, 50), // Median duration
MinDuration=min(Duration), // Minimum duration
MaxDuration=max(Duration) // Maximum duration
by PipelineName, Location
| project PipelineName, Location, CountRuns, AverageDuration, MedianDuration, MinDuration, MaxDuration
| order by PipelineName desc, Location asc
Analysis and Conclusions
The data shows significant variability in execution times, even for pipelines that run trivial logic without data transfers. Regions with higher workloads, like West Europe and East US, exhibit longer and more variable processing times compared to newer or less burdened regions such as Poland Central. This suggests that selecting a region with lower demand could potentially lead to more consistent and faster pipeline execution times. In some cases, the difference between the shortest and the longest pipeline run time at the same ADF workspace was more than threefold.
Other interesting conclusions:
- It seems that the pipeline launch is directly related to the search for resources to perform all the activities contained within it.
- You can very easily identify delays in the pipeline start by analyzing the time difference reported by the engine in the service logs. Simply parse the JSON in the SystemParameters column and find the difference between PipelineRunRequestTime and ExecutionStart. The query is provided below in this article.
- Unfortunately, the same cannot be checked for activities. You can only analyze the intervals between the launches of subsequent activities, for example in a ForEach, which may consist of seconds, but this is not a significant amount of time.
- Eventually, starting the pipeline by the ADF engine usually involves executing all activities without significant delays, although some transitions through ForEach loops took longer than others.
Therefore, one should consider an increased risk of delays associated with the launching of many pipelines, including those that are nested. These will be the subject of my further research, and I will describe them in the next article.
However, based on my preliminary findings and experience with Azure Data Factory projects, it appears that the more pipelines you have, the greater the delays you should expect. This implies that the technique of modularizing tasks by abstracting repeated logic into separate pipelines and then invoking them through an “execute pipeline” activity can result in the loss of valuable seconds. Aiming to consolidate activities into a single pipeline can mitigate this risk, though one must consider the limit of 40 activities per pipeline.
Of course, this does not change the fact that the shortest pipeline time with a single ForEach activity for Poland Central was 16 seconds, and for West Europe exactly twice as much – 32 seconds. Consequently, the execution of activities itself is also subject to delays, not just the waiting time for the pipeline to start. Apart from the load, it’s also worth noting that iterating over 500 elements and setting the value of one variable should not be a process that requires a lot of computing power, and this operation in other programming languages, even when performed sequentially, takes only two to three seconds (in SSIS it takes just three, I checked).
Meanwhile, in ADF, parallelized to 20 threads (as much as default ForEach provides, can be 50 at maximum), it takes dozens of seconds, and when run sequentially, it takes as long as 5-6 minutes, and only in Poland Central was it possible to perform it in 3 minutes 36 seconds, which is still a questionable result.
By the way, I would like to emphasize a very important piece of information here:
Azure Data Factory (ADF) may not be the ideal choice for tasks requiring the execution of logic with a very frequent cadence, such as micro-batching. The platform’s architecture, which excels at managing ETL processes and orchestrating complex data workflows, may face challenges in consistently meeting the tight scheduling demands these tasks require. However, for scenarios demanding more regular pipeline execution without the risk of missing time-based triggers due to unpredictable duration, ADF’s tumbling window triggers offer a robust solution. They allow for back-to-back processing windows, maintaining a continuous flow of execution.
For use cases that demand real-time or near-real-time data processing, streaming technologies in Databricks, Microsoft Fabric, or other Spark-based solutions are often more suitable. These technologies provide a streaming data framework that can process high volumes of data with low latency, supporting scenarios where immediate data ingestion and processing are crucial. They are better aligned with real-time analytics needs and can handle the demands of streaming logic with greater efficiency than batch-oriented tools like ADF.
Analyzing Pipeline Start Delays: PipelineRunRequestTime vs. ExecutionStart
Start delays can be a critical factor in the efficiency and timing of data processing workflows. To provide clarity on these delays, we compare two pivotal timestamps: PipelineRunRequestTime and ExecutionStart. PipelineRunRequestTime is the timestamp when the run request is received by the ADF engine, while ExecutionStart marks the actual commencement of execution.
By parsing the SystemParameters JSON column within ADF logs, we can extract these times and calculate the latency that occurs before the pipeline activities kick off. This analysis not only aids in pinpointing the bottlenecks that may affect overall performance but also serves as a benchmark for optimizing the scheduling and resource allocation for high-frequency pipeline runs.
The code below calculates three distinct time measurements in Azure Data Factory (ADF) pipeline runs:
- RequestToStartDuration – This column measures the delay between the pipeline run request and the start of execution. It captures the duration ADF was in a waiting state, not executing any pipeline activities and possibly queuing resources. This delay is a significant metric for understanding the responsiveness of ADF scheduling.
- RealExecutionTime – This column reflects the actual time ADF takes to execute the pipeline once it has started. It is computed by deducting the RequestToStartDuration from the TotalPipelineDuration, providing insight into the time spent on the active pipeline processing phases.
- TotalPipelineDuration – This column is the total elapsed time for the pipeline run, from request to completion. It encompasses all periods, including any initial delays and the actual execution time, giving a full scope of the time investment for a pipeline run.
| where Status == "Succeeded" // Filter to only include successful pipeline runs
| where PipelineName startswith "PL_Simple" // Filter pipelines starting with "PL_Simple"
| extend SystemParametersParsed = parse_json(SystemParameters) // Parse the SystemParameters JSON column
| extend TotalPipelineDuration = End - Start // Calculate the total duration of the pipeline
| extend ExecutionStart = todatetime(SystemParametersParsed.ExecutionStart), PipelineRunRequestTime = todatetime(SystemParametersParsed.PipelineRunRequestTime) // Convert ExecutionStart and PipelineRunRequestTime to datetime format
| extend RequestToStartDuration = ExecutionStart - PipelineRunRequestTime // Calculate the duration from request to start
| extend RealExecutionTime = TotalPipelineDuration - RequestToStartDuration // Calculate real execution time by subtracting the initial delay from the total duration
| project PipelineName, Location, RequestToStartDuration, RealExecutionTime, TotalPipelineDuration // Select and rename columns for output, including the new RealExecutionTime
| order by RequestToStartDuration desc // Order the results by duration from request to start, descending
| take 5 // Limit the results to the top 5 records
This article was created with the invaluable assistance of ChatGPT 4, which was particularly helpful with grammar, paraphrasing, and KQL queries. 🙂
The logo was also created with the help of DALL-E from OpenAI.