Cosmos DB Backup Verification and Restore Drill
On a schedule, the flow verifies Cosmos DB continuous backup state, performs a point-in-time restore to a scratch account for a sample, validates document integrity, tears it down, and reports recoverability to Teams. Proves Cosmos backups actually restore.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Overview
This flow proves that Azure Cosmos DB continuous backups can actually be recovered. On a weekly schedule it (1) confirms the source account has continuous (point-in-time) backup enabled, (2) performs a point-in-time restore into a disposable scratch account, (3) polls the restore to completion, (4) validates document integrity on the restored copy, (5) tears the scratch account down to avoid ongoing cost, and (6) reports the recoverability outcome to a Microsoft Teams channel - every report stamped with a correlation id.
Why it matters: Untested backups are a false sense of safety. A periodic restore drill is the only way to confirm real recoverability (RPO/RTO) and surface configuration gaps before a real disaster.
The flow ships Off. Going live requires only authorizing the three connections (ARM, Cosmos DB, Teams) and setting the environment-variable values.
Use Case
A platform or SRE team that runs business-critical data on Cosmos DB needs recurring, auditable proof that the data can be recovered to a recent point in time. Rather than trust that "continuous backup is on," this drill exercises the full restore path on a schedule and posts a pass/fail recoverability report to the team's channel.
Flow Architecture
Weekly Drill Window
RecurrenceRuns the restore drill weekly during a quiet maintenance window (Sat 02:00 ET).
Initialize CorrelationId and config
Initialize Variable x13Mint a guid() for traceability and load all 12 config values from their environment variables (subscription, resource group, source/scratch account, database, container, restore minutes, location, ARM API version, restorable source id, Teams group/channel).
Compute RestoreTimestamp
Initialize VariableCompute the UTC point-in-time target = now minus the configured restore-point minutes; also seed poll and report working variables.
Get Source Cosmos Account
ARM Resources_GetByIdRead the source account and compose properties.backupPolicy.type to detect Continuous vs Periodic backup.
Condition Backup Is Continuous
Condition (If)Only proceed with the restore drill when continuous point-in-time backup is enabled; otherwise mark FAIL with an explanation.
Start PITR Restore
ARM Resources_CreateOrUpdateByIdProvision a disposable scratch account via point-in-time restore (createMode: Restore) and capture the initial provisioning state.
Until Restore Complete
Do Until (Delay + ARM Resources_GetById)Poll the scratch account status until provisioning reaches Succeeded or Failed.
Validate Restored Documents
Cosmos QueryDocuments_V5On success, run a COUNT integrity query on the restored copy, record the document count, and set status PASS (else FAIL).
Environment Variables
| Schema name | Type | Default | Description |
|---|---|---|---|
| flowlibs_AzureSubscriptionId | String | <your-subscription-id> | Azure subscription GUID hosting the Cosmos accounts (reused) |
| flowlibs_CosmosResourceGroup | String | flowlibs-data-rg | Resource group of the Cosmos accounts |
| flowlibs_CosmosSourceAccount | String | flowlibs-cosmos-prod | Source account whose backups are verified |
| flowlibs_CosmosScratchAccount | String | flowlibs-cosmos-restoretest | Disposable account created and torn down by the drill |
| flowlibs_CosmosDatabaseId | String | ReferenceDb | Database id for the integrity query (reused) |
| flowlibs_CosmosContainerId | String | ReferenceData | Container id for the integrity query (reused) |
| flowlibs_RestorePointMinutes | String | 60 | Minutes before now to target for the restore |
| flowlibs_CosmosLocation | String | eastus | Azure region for the restored scratch account |
| flowlibs_CosmosArmApiVersion | String | 2024-11-15 | ARM API version for Cosmos databaseAccounts calls |
Connectors & Connections
| Connector | API name | Actions used |
|---|---|---|
| Azure Resource Manager | shared_arm | Resources_GetById Resources_CreateOrUpdateById Resources_DeleteById |
| Azure Cosmos DB | shared_documentdb | QueryDocuments_V5 |
| Microsoft Teams | shared_teams | PostMessageToConversation |
Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.
Customization Guide
Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.
- Schedule
- Change the Weekly Drill Window recurrence (frequency/day/hour/time zone) to match your maintenance window or RPO targets.
- Restore point
- Set flowlibs_RestorePointMinutes to drill a different point in time (e.g. 1440 for ~24h ago).
- Auto-discover the restorable account
- Instead of the flowlibs_RestorableSourceAccountId env var, add an HTTP GET (ARM OAuth) to the restorableDatabaseAccounts endpoint, filter by properties.accountName, and feed the matched id into restoreSource.
- Scope the validation
- Point flowlibs_CosmosDatabaseId/ContainerId at a representative sample, or add more QueryDocuments_V5 checks (per-partition counts, checksums) for deeper integrity assurance.
- Guard validate on success
- Validation already runs only inside the Restore Succeeded branch; teardown runs regardless as a cost guard. Add a failure path (e.g. retry or alert) if teardown itself fails.
- Report destination
- Swap the Teams action for email, an incident ticket, or a Dataverse log table to retain a recoverability audit trail.
Key Expressions
The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.
EXPR.01Restore point (UTC)
Targets a point-in-time N minutes before now for the PITR restore
EXPR.02Backup type
Extracts Continuous vs Periodic from the source account
EXPR.03Restore body (properties)
Builds the ARM restore properties object passed to Resources_CreateOrUpdateById
EXPR.04Provisioning state (poll)
Reads provisioning state each poll iteration
EXPR.05Restored document count
Pulls the COUNT result from the integrity query
EXPR.06Loop exit
Exits the Do Until when restore reaches a terminal state
Comments
Sign in to join the conversation.
Sign inNo comments yet. Be the first to share your experience with this flow.