Data Lake Raw-Zone Ingestion and Cataloging
When files land in the Data Lake raw zone, the flow validates and registers them in a Dataverse catalog (schema, partition, source, row count), promotes valid files to a date-partitioned curated zone, quarantines bad ones, and notifies data engineering. Operationalizes raw-zone intake for a lakehouse.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Overview
This flow operationalizes raw-zone intake for a lakehouse on Azure Data Lake (Gen1). On a schedule it scans the raw/landing zone, reads each file, validates it, registers it in a Dataverse data catalog (file name, extension/schema, date partition, source system, row count, status), then promotes valid files to a date-partitioned curated zone and moves invalid ones to a quarantine zone. It finishes by posting a run summary to the data-engineering Teams channel.
Why it matters: ungoverned raw-zone dumps become data swamps. Cataloging every file and enforcing zone promotion keeps the lake organized, auditable, and trustworthy, with a single correlation id tracing each ingestion batch end to end.
Ships Off (demo). All-connector reference implementation - no HTTP fallbacks.
Use Case
A data-engineering team lands files into a Data Lake raw zone from upstream systems (ERP exports, partner feeds, IoT batches) and needs them cataloged, validated, and promoted reliably without manual babysitting - a Dataverse catalog of everything that arrived, automatic promotion of conformant files, automatic quarantine of bad ones, and a per-run Teams summary.
Flow Architecture
Scan Raw Zone (Recurrence)
RecurrencePolls the raw zone on a schedule (default every 15 min); swap for the Data Lake list-files trigger for event-style firing.
Initialize Config & Counters
Initialize variableMints a correlation id; binds raw/curated/quarantine paths, allowed extensions, source system, account name, Teams ids; seeds cataloged/quarantined counters.
List Raw Zone Files
Azure Data Lake - ListFilesLists the raw-zone path (the real data source).
For Each File
Apply to each (concurrency 1)For each file (skip directories): reads content, computes extension + CSV row count, validates (non-empty AND extension on the allow-list).
Promote or Quarantine
Dataverse CreateRecord + ADL UploadFile/DeleteFileValid files: catalog as Cataloged, upload to the date-partitioned curated path, delete from raw (move). Invalid files: catalog as Quarantined, upload to quarantine, delete from raw. Each branch increments its counter.
Notify Data Engineering
Compose + TeamsBuilds an HTML run summary (correlation id, source, counts) and posts it to the data-engineering channel.
Environment Variables
| Schema name | Type | Default | Description |
|---|---|---|---|
| flowlibs_RawZonePath | String | /raw/inbound | Raw/landing zone folder scanned each run. |
| flowlibs_CuratedZonePath | String | /curated | Curated zone root (date-partitioned subfolders appended). |
| flowlibs_QuarantineZonePath | String | /quarantine | Destination for files that fail validation. |
| flowlibs_DataLakeAllowedExtensions | String | csv,json,tsv,xml | Comma-separated allow-list of accepted extensions. |
| flowlibs_DataLakeSourceSystem | String | ERP-Export | Source-system label stamped on each catalog row. |
| flowlibs_DataLakeAccountName | String | your-adls-gen1-account | ADLS Gen1 account name targeted by every file op. |
| flowlibs_TeamsGroupId | String | <your-team-id> | Teams team (group) id for the notification. |
| flowlibs_TeamsChannelId | String | <your-channel-id> | Teams channel id for the notification. |
Connectors & Connections
| Connector | API name | Actions used |
|---|---|---|
| Azure Data Lake | shared_azuredatalake | ListFiles ReadFile UploadFile DeleteFile |
| Microsoft Dataverse | shared_commondataserviceforapps | CreateRecord |
| Microsoft Teams | shared_teams | PostMessageToConversation |
Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.
Customization Guide
Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.
- Trigger style
- Replace the Recurrence with the Data Lake list-files trigger (or Event Grid blob-created) for near-real-time intake.
- Validation depth
- Extend the valid check with schema-conformance (header match, column count, JSON/XML parse) beyond extension + non-empty.
- Row counting
- Compose Row Count handles CSV; add branches for JSON/TSV/XML to populate the row count per format.
- Partitioning
- The curated path uses yyyy/MM/dd; switch to source- or entity-based partition folders by editing Compose Curated Path.
- Heavy transforms
- For large files, trigger Azure Data Factory/Databricks from the valid branch instead of processing content in the flow.
Key Expressions
The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.
EXPR.01Iterate listing
The ADLS Gen1 file listing.
EXPR.02Extension
Lowercase extension for the allow-list check.
EXPR.03CSV row count
Data lines minus header for CSV.
EXPR.04Curated path
Date-partitioned destination.
Comments
Sign in to join the conversation.
Sign inNo comments yet. Be the first to share your experience with this flow.