Data Lake Schema Drift Detection
When new files arrive in a dataset path, the flow infers their schema, compares it to the registered schema, and on drift (new/removed/changed columns) blocks promotion, opens a ticket, and alerts data engineering with the diff. Prevents schema drift from breaking downstream consumers.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Overview
This flow guards Azure Data Lake datasets against silent schema drift. When a new file arrives in a monitored dataset path, the flow infers its schema from the CSV header, compares it to the registered baseline schema stored in Dataverse, and classifies any change as additive (new columns) or breaking (removed columns). On drift it sets the dataset's promotion status to Blocked, opens a drift ticket in Dataverse, and alerts the data-engineering channel in Teams with the exact column diff. On first sight of a dataset it registers the baseline automatically. Why it matters: silent schema changes break downstream pipelines and reports. Detecting drift at ingestion stops bad data from propagating and gives data engineering a precise, actionable diff.
Use Case
A data engineering team ingests curated CSV extracts into a Data Lake zone. They want automatic protection against unexpected schema changes from upstream producers: new columns should be allowed (configurable), but removed or renamed columns must block promotion and raise an alert before consumers break.
Flow Architecture
When a File Is Added or Modified in the Data Lake
Azure Data Lake - Trigger_ListFiles (metadata polling, splitOn)Fires per new or modified file in the monitored dataset path; splitOn fans each file into its own run.
Initialize Trace & Config
Initialize variableMints a correlation guid for traceability, captures the file name (pathSuffix) and full read path, loads the additive policy flag and registry table name, and sets up working state for the comparison.
Check If File Is CSV
ConditionOnly .csv files of type FILE are processed.
Read & Infer Schema
Azure Data Lake - ReadFile + ComposeReads the file content and extracts/trims the CSV header into the current column array.
Get Registered Schema
Microsoft Dataverse - ListRecordsLooks up the baseline schema for this dataset path and branches on whether a baseline exists.
Classify Drift
Filter Array + Set VariableFilters added columns (in new, not baseline) and removed columns (in baseline, not new), then sets severity (None / First Registration / Additive / Breaking) and promotion status (Blocked on breaking or disallowed additive).
Gate & Alert
Microsoft Dataverse - CreateRecord + Microsoft Teams - PostMessageToConversationOn drift, opens a Dataverse drift ticket with the diff and posts the column diff to the data-engineering Teams channel; on first sight, registers the baseline schema.
Environment Variables
| Schema name | Type | Default | Description |
|---|---|---|---|
| flowlibs_DataLakeAccount | String | REPLACE_WITH_DATALAKE_ACCOUNT | Data Lake storage account name. |
| flowlibs_DataLakeDatasetPath | String | curated/sales | Folder path (no leading slash) monitored for arrivals. |
| flowlibs_SchemaRegistryTable | String | flowlibs_schemaregistries | Dataverse entity-set holding baseline schemas. |
| flowlibs_AllowAdditive | String | true | When true, new columns are allowed (additive). |
| flowlibs_TeamsGroupId | String | <your-team-id> | Teams group/team id for the alert channel. |
| flowlibs_TeamsChannelId | String | <your-channel-id> | Teams channel id for the alert. |
Connectors & Connections
| Connector | API name | Actions used |
|---|---|---|
| Azure Data Lake | shared_azuredatalake | Trigger_ListFiles (trigger) ReadFile |
| Microsoft Dataverse | shared_commondataserviceforapps | ListRecords CreateRecord |
| Microsoft Teams | shared_teams | PostMessageToConversation |
Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.
Customization Guide
Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.
- Event-driven vs scheduled
- Trigger_ListFiles polls metadata hourly by default. Lower the recurrence interval for faster detection, or pair with an Event Grid blob-created subscription for true push.
- Allow / block additive
- Flip flowlibs_AllowAdditive to false to enforce a strict data contract where any new column also blocks promotion.
- File formats
- The header-inference logic targets CSV. For Parquet/JSON, replace Infer Header Row with a footer/metadata parse.
- Auto-evolve
- Extend the additive-allowed branch to update the registry (increment flowlibs_schemaversion) so the baseline tracks approved additions automatically.
- Ticketing
- Swap or supplement Create Drift Ticket with an Azure DevOps / Jira work item if drift should open a tracked engineering ticket.
Key Expressions
The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.
EXPR.01Infer current schema (CSV header)
Extracts and trims the CSV header row as the current schema.
EXPR.02Added columns (Filter Array where)
Columns present in the new file but not in the baseline.
EXPR.03Removed columns (Filter Array where)
Columns present in the baseline but not in the new file (breaking).
EXPR.04Drift detected
Baseline exists AND columns were added or removed.
EXPR.05Registry lookup filter
OData filter to find the baseline for this dataset path.
Comments
Sign in to join the conversation.
Sign inNo comments yet. Be the first to share your experience with this flow.