ELT Pipeline with Medallion Architecture

Value statement: Architected production ELT pipeline demonstrating modern data stack best practices with medallion architecture, dimensional modeling, and comprehensive data quality testing.

Overview

Built production-grade ELT pipeline implementing medallion architecture (Bronze/Silver/Gold) with dbt transformations orchestrated by Apache Airflow. Designed dimensional models (fact and star schema) with comprehensive data quality tests ensuring 99%+ accuracy. Demonstrates modern analytics engineering patterns including incremental models for efficiency, dbt tests for validation, and Airflow DAG with SLA monitoring.

The pipeline serves as a reference implementation for scalable, maintainable data transformation workflows following industry best practices from dbt Labs and Snowflake.

Goals

Implement medallion architecture following lakehouse patterns
Design dimensional models (fact tables + star schema)
Build incremental dbt models for efficient processing
Achieve 99%+ data quality through automated testing
Orchestrate transformations with Airflow SLA monitoring
Establish data contracts and documentation standards
Enable self-service analytics for business users

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Source Systems                             │
│         Operational DBs | APIs | File Systems               │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              ELT Ingestion Layer                             │
│        Fivetran | Airbyte | Custom Python Scripts           │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│           BRONZE: Raw Landing Zone (Snowflake)              │
│        Immutable source data with audit columns             │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│      SILVER: Cleansed & Conformed (dbt models)              │
│    Type casting | Deduplication | Business rules            │
│           dbt tests: uniqueness, not_null, relationships    │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│        GOLD: Analytics Layer (dimensional models)           │
│    Fact tables (transactions) + Dimensions (star schema)    │
│       Incremental materialization | SCD Type 2 tracking     │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Consumption Layer                               │
│        BI Tools | Data Science | APIs | Reporting           │
└─────────────────────────────────────────────────────────────┘

                    Orchestrated by Airflow DAG
                   (SLA monitoring + alerting)

Technology Stack

Layer	Technologies
Transformation	dbt Core, dbt Cloud, Jinja templating
Data Warehouse	Snowflake (virtual warehouses + clustering)
Orchestration	Apache Airflow (DAGs with SLA monitoring)
Version Control	Git, GitHub (CI/CD workflows)
Testing	dbt tests (schema + data quality)
Documentation	dbt docs (auto-generated lineage)
Modeling	SQL, dimensional modeling (Kimball)

Implementation Details

Bronze/Silver/Gold Layer Transformations:

Bronze: Raw data persisted as-is with _loaded_at, _source_system audit columns. Immutable layer for reprocessing and auditing.
Silver: Cleansing models handle type casting, null handling, deduplication via row_number(), and business rule validation. Implemented using dbt incremental models with merge strategy.
Gold: Dimensional models following Kimball methodology with fact tables (e.g., fact_transactions) and dimension tables (star schema). SCD Type 2 tracking preserves historical changes.

Fact and Dimension Tables (Star Schema):

Fact Tables: fact_sales, fact_orders, fact_payments with foreign keys to dimensions and additive measures
Dimensions: dim_customer, dim_product, dim_date, dim_location with surrogate keys and slowly changing dimension support
Bridge Tables: Handle many-to-many relationships where needed

dbt Tests for Data Quality (99%+ Accuracy): Comprehensive test suite ensuring:

Uniqueness tests: Primary keys, composite keys
Not-null tests: Required columns
Relationship tests: Foreign key integrity
Accepted values tests: Enum validation
Custom tests: Business logic validation (e.g., order_date < ship_date)
Freshness tests: SLA monitoring for data staleness

Incremental Models for Efficiency: Used dbt incremental materialization strategy with:

is_incremental() macro for conditional logic
Merge strategy for upserts (update + insert)
Delete logic for hard deletes propagation
Reduces processing time from hours to minutes for large tables

Airflow DAG with SLA Monitoring: Orchestration DAG includes:

dbt run → dbt test → dbt snapshot sequence
SLA definitions for each transformation stage
Alerting via Slack/email on test failures or SLA breaches
Retry logic with exponential backoff
DAG-level and task-level monitoring

Data Characteristics

Metric	Value
Silver Layer Models	30+ staging models
Gold Layer Models	15+ fact/dimension tables
dbt Tests	150+ automated tests
Test Success Rate	99%+ data quality accuracy
Processing Time	45 min (full refresh)
Data Volume	50M+ rows in fact tables

Reliability & Edge Cases

Schema evolution: dbt on_schema_change='append_new_columns' handles additive schema changes gracefully
Late-arriving dimensions: SCD Type 2 logic handles dimension updates retroactively
Test failures halt pipeline: Airflow configured to stop downstream tasks on dbt test failures
Idempotency: All models support re-runs without duplication via unique key constraints
Backfill support: Date-range macros enable historical data reprocessing
Snowflake optimization: Clustering keys and automatic clustering for query performance

Lessons Learned

Incremental model complexity: Initially used unique_key approach which caused performance issues on large tables. Refactored to use partitioned merge with date-based filtering, reducing processing time by 70%.

dbt test coverage: Started with basic schema tests but discovered data quality issues late in pipeline. Implemented custom tests for business logic validation upfront, catching issues at Silver layer.

Documentation as code: Auto-generated dbt docs with lineage DAGs became invaluable for stakeholder communication and onboarding. Treated model descriptions as first-class documentation requirements.

Star schema design: Over-normalized dimension tables initially caused complex joins. Simplified to pure star schema (no snowflaking) which improved query performance for BI tools by 3x.

Future Improvements

Add Great Expectations for advanced statistical data quality checks
Implement dbt metrics layer for semantic consistency
Add dbt exposures to track downstream BI dashboard dependencies
Integrate with dbt Cloud for scheduling and observability
Implement data contracts with schema enforcement at ingestion
Add row-level security (RLS) for multi-tenant data access
Build dbt packages for reusable macros and models
Add data mesh patterns with domain-oriented ownership