ELT Pipeline with Medallion Architecture
Value statement: Architected production ELT pipeline demonstrating modern data stack best practices with medallion architecture, dimensional modeling, and comprehensive data quality testing.
Overview
Built production-grade ELT pipeline implementing medallion architecture (Bronze/Silver/Gold) with dbt transformations orchestrated by Apache Airflow. Designed dimensional models (fact and star schema) with comprehensive data quality tests ensuring 99%+ accuracy. Demonstrates modern analytics engineering patterns including incremental models for efficiency, dbt tests for validation, and Airflow DAG with SLA monitoring.
The pipeline serves as a reference implementation for scalable, maintainable data transformation workflows following industry best practices from dbt Labs and Snowflake.
Goals
- Implement medallion architecture following lakehouse patterns
- Design dimensional models (fact tables + star schema)
- Build incremental dbt models for efficient processing
- Achieve 99%+ data quality through automated testing
- Orchestrate transformations with Airflow SLA monitoring
- Establish data contracts and documentation standards
- Enable self-service analytics for business users
Architecture
┌─────────────────────────────────────────────────────────────┐│ Source Systems ││ Operational DBs | APIs | File Systems │└──────────────────────────┬──────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────┐│ ELT Ingestion Layer ││ Fivetran | Airbyte | Custom Python Scripts │└──────────────────────────┬──────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────┐│ BRONZE: Raw Landing Zone (Snowflake) ││ Immutable source data with audit columns │└──────────────────────────┬──────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────┐│ SILVER: Cleansed & Conformed (dbt models) ││ Type casting | Deduplication | Business rules ││ dbt tests: uniqueness, not_null, relationships │└──────────────────────────┬──────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────┐│ GOLD: Analytics Layer (dimensional models) ││ Fact tables (transactions) + Dimensions (star schema) ││ Incremental materialization | SCD Type 2 tracking │└──────────────────────────┬──────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────────┐│ Consumption Layer ││ BI Tools | Data Science | APIs | Reporting │└─────────────────────────────────────────────────────────────┘
Orchestrated by Airflow DAG (SLA monitoring + alerting)Technology Stack
| Layer | Technologies |
|---|---|
| Transformation | dbt Core, dbt Cloud, Jinja templating |
| Data Warehouse | Snowflake (virtual warehouses + clustering) |
| Orchestration | Apache Airflow (DAGs with SLA monitoring) |
| Version Control | Git, GitHub (CI/CD workflows) |
| Testing | dbt tests (schema + data quality) |
| Documentation | dbt docs (auto-generated lineage) |
| Modeling | SQL, dimensional modeling (Kimball) |
Implementation Details
Bronze/Silver/Gold Layer Transformations:
- Bronze: Raw data persisted as-is with
_loaded_at,_source_systemaudit columns. Immutable layer for reprocessing and auditing. - Silver: Cleansing models handle type casting, null handling, deduplication via
row_number(), and business rule validation. Implemented using dbt incremental models with merge strategy. - Gold: Dimensional models following Kimball methodology with fact tables (e.g.,
fact_transactions) and dimension tables (star schema). SCD Type 2 tracking preserves historical changes.
Fact and Dimension Tables (Star Schema):
- Fact Tables:
fact_sales,fact_orders,fact_paymentswith foreign keys to dimensions and additive measures - Dimensions:
dim_customer,dim_product,dim_date,dim_locationwith surrogate keys and slowly changing dimension support - Bridge Tables: Handle many-to-many relationships where needed
dbt Tests for Data Quality (99%+ Accuracy): Comprehensive test suite ensuring:
- Uniqueness tests: Primary keys, composite keys
- Not-null tests: Required columns
- Relationship tests: Foreign key integrity
- Accepted values tests: Enum validation
- Custom tests: Business logic validation (e.g., order_date < ship_date)
- Freshness tests: SLA monitoring for data staleness
Incremental Models for Efficiency: Used dbt incremental materialization strategy with:
is_incremental()macro for conditional logic- Merge strategy for upserts (update + insert)
- Delete logic for hard deletes propagation
- Reduces processing time from hours to minutes for large tables
Airflow DAG with SLA Monitoring: Orchestration DAG includes:
- dbt run → dbt test → dbt snapshot sequence
- SLA definitions for each transformation stage
- Alerting via Slack/email on test failures or SLA breaches
- Retry logic with exponential backoff
- DAG-level and task-level monitoring
Data Characteristics
| Metric | Value |
|---|---|
| Silver Layer Models | 30+ staging models |
| Gold Layer Models | 15+ fact/dimension tables |
| dbt Tests | 150+ automated tests |
| Test Success Rate | 99%+ data quality accuracy |
| Processing Time | 45 min (full refresh) |
| Data Volume | 50M+ rows in fact tables |
Reliability & Edge Cases
- Schema evolution: dbt
on_schema_change='append_new_columns'handles additive schema changes gracefully - Late-arriving dimensions: SCD Type 2 logic handles dimension updates retroactively
- Test failures halt pipeline: Airflow configured to stop downstream tasks on dbt test failures
- Idempotency: All models support re-runs without duplication via unique key constraints
- Backfill support: Date-range macros enable historical data reprocessing
- Snowflake optimization: Clustering keys and automatic clustering for query performance
Lessons Learned
Incremental model complexity: Initially used unique_key approach which caused performance issues on large tables. Refactored to use partitioned merge with date-based filtering, reducing processing time by 70%.
dbt test coverage: Started with basic schema tests but discovered data quality issues late in pipeline. Implemented custom tests for business logic validation upfront, catching issues at Silver layer.
Documentation as code: Auto-generated dbt docs with lineage DAGs became invaluable for stakeholder communication and onboarding. Treated model descriptions as first-class documentation requirements.
Star schema design: Over-normalized dimension tables initially caused complex joins. Simplified to pure star schema (no snowflaking) which improved query performance for BI tools by 3x.
Future Improvements
- Add Great Expectations for advanced statistical data quality checks
- Implement dbt metrics layer for semantic consistency
- Add dbt exposures to track downstream BI dashboard dependencies
- Integrate with dbt Cloud for scheduling and observability
- Implement data contracts with schema enforcement at ingestion
- Add row-level security (RLS) for multi-tenant data access
- Build dbt packages for reusable macros and models
- Add data mesh patterns with domain-oriented ownership