Data Factory Pipeline Failure Triage
When an Azure Data Factory pipeline run fails, the flow pulls the failed activity, error message, and inputs, classifies the failure type, opens a Dataverse incident, alerts the data team in Teams with a rerun button, and auto-retries transient failures. Speeds recovery from ADF pipeline failures.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Overview
This flow speeds recovery from Azure Data Factory (ADF) pipeline failures. When a pipeline run fails, an HTTP failure callback triggers the flow. It reads the run metadata, pulls the failed activity and its error from the ADF REST queryActivityruns endpoint, classifies the failure as Transient or Fatal, opens a Dataverse incident, auto-retries transient failures by starting a fresh pipeline run, and alerts the data team in Teams with the full context and a "Rerun in ADF Studio" link.
Why it matters: ADF failures often go unnoticed until downstream data is late. Automated triage with root-cause context, a logged incident, auto-retry for transient errors, and a one-click rerun link shortens time-to-recovery and removes manual log-digging.
Ships Off (demo).
Use Case
A data engineering team wants immediate, actionable handling of ADF pipeline failures: a durable incident record, automatic retry of known-transient errors, and a Teams alert that carries the error and a rerun shortcut - without anyone watching the ADF monitoring blade.
Flow Architecture
When an HTTP Request Is Received
Request (HTTP)Receives the failure callback (runId, pipelineName, dataFactoryName, status, message) from an Azure Monitor webhook or tail Web activity.
Initialize Trace & Config
Initialize variableMints a correlation id and captures trigger inputs plus env vars: subscription/resource group/factory, Teams ids, transient-code list, and working vars.
Get Pipeline Run Details
Azure Data Factory - GetPipelineRunReads run metadata (status, message, pipeline name, start/end, duration).
Query Activity Runs
HTTP - ADF REST queryActivityrunsLists the run's activities with per-activity status, error code, and message via Entra-authenticated built-in HTTP.
Extract & Classify Failure
Filter array + ComposeKeeps failed activities, takes the first as root cause, extracts error code/message, and classifies Transient or Fatal against configured substrings.
Auto-Retry If Transient
Condition + Azure Data Factory - CreatePipelineRunOn a transient classification, starts a fresh pipeline run and records the rerun id; otherwise notes manual triage.
Open Incident
Microsoft Dataverse - CreateRecordOpens an incident with the full failure context, classification, action taken, rerun id, and correlation id; status Open.
Post Teams Alert
Microsoft Teams - PostMessageToConversationPosts the failure summary, classification, action taken, and a Rerun in ADF Studio link to the data team channel.
Environment Variables
| Schema name | Type | Default | Description |
|---|---|---|---|
| flowlibs_DataFactoryName | String | adf-enterprise-prod | ADF factory name. |
| flowlibs_AzureSubscriptionId | String | <your-subscription-id> | Subscription that owns the factory. |
| flowlibs_ResourceGroupName | String | rg-data-prod | Factory's resource group. |
| flowlibs_AzureTenantId | String | <your-tenant-id> | Entra tenant id for the REST OAuth. |
| flowlibs_AzureClientId | String | <your-client-id> | Entra app (client) id - needs Data Factory Reader/Contributor. |
| flowlibs_AzureClientSecret | String | <configure> | Entra app secret. |
| flowlibs_TeamsGroupId | String | <your-team-id> | Teams team (group) id for alerts. |
| flowlibs_TeamsChannelId | String | <your-channel-id> | Teams channel id for alerts. |
| flowlibs_TransientErrorCodes | String | ["Timeout","ThrottlingException","ServiceBusy","TransientFailure","429","503","504","ConnectionReset"] | JSON array of substrings that mark a failure transient/auto-retryable. |
Connectors & Connections
| Connector | API name | Actions used |
|---|---|---|
| Azure Data Factory | shared_azuredatafactory | GetPipelineRun CreatePipelineRun |
| HTTP | http | queryActivityruns (ADF REST) |
| Microsoft Dataverse | shared_commondataserviceforapps | CreateRecord |
| Microsoft Teams | shared_teams | PostMessageToConversation |
Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.
Customization Guide
Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.
- Transient list
- Tune the transient error codes to your environment's retryable codes/messages.
- Backoff
- Add a Delay before the rerun, or a retry counter on the incident to cap auto-retries and back off exponentially.
- Root-cause hints
- Map known error codes to runbook links and include them in the Teams alert.
- SLA escalation
- Add a branch that escalates (for example @-mentions on-call) when the same pipeline fails repeatedly within a window.
- Rerun parameters
- Pass the original run's parameters into CreatePipelineRun if your pipelines are parameterized.
Key Expressions
The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.
EXPR.01Classification
Transient when any configured substring matched the error text, else Fatal.
EXPR.02Transient match
Filter test over the configured transient substrings.
EXPR.03queryActivityruns URI
ADF REST endpoint for per-activity errors.
Comments
Sign in to join the conversation.
Sign inNo comments yet. Be the first to share your experience with this flow.