Real-Time Stock Market Data Pipeline

Value statement: Stream live stock market data through Kafka-based medallion architecture for real-time financial analytics and Power BI dashboards.

Overview

Built distributed streaming data pipeline capturing real-time stock market data from Finnhub API. System ingests live market events via Kafka producers, processes through medallion architecture (Bronze/Silver/Gold) using dbt transformations in Snowflake, and serves analytics dashboards through Power BI. Orchestrated via Airflow DAGs with automated ingestion and monitoring.

The pipeline handles high-throughput financial data streams, implements schema enforcement and data quality checks, and provides sub-minute latency for trading analytics and market research use cases.

Goals

Stream real-time stock data from Finnhub API at scale
Implement event-driven architecture with Kafka producers/consumers
Build medallion data lakehouse with MinIO (S3-compatible storage)
Transform raw events into analytics-ready datasets with dbt
Orchestrate ingestion and monitoring via Airflow DAGs
Serve dashboards for real-time market analysis
Ensure fault tolerance and exactly-once delivery semantics

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Finnhub API                               │
│              Live Stock Market WebSocket/REST                │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Kafka Producers (Python)                        │
│        Real-time event capture with schema validation       │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                Apache Kafka Cluster                          │
│       Distributed event streaming with partitioning         │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│           BRONZE: Raw Events (MinIO S3)                     │
│         Immutable event store with time partitioning        │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│        SILVER: Cleansed Data (dbt + Snowflake)              │
│     Deduplication | Schema enforcement | Validation         │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│         GOLD: Analytics Tables (Snowflake)                  │
│       Aggregations | Time-series | Market indicators        │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Power BI Dashboards                             │
│      Real-time market analytics and trading insights        │
└─────────────────────────────────────────────────────────────┘

Technology Stack

Layer	Technologies
Data Source	Finnhub API (WebSocket + REST)
Ingestion	Python Kafka Producers, Confluent Kafka
Streaming	Apache Kafka (multi-broker cluster)
Storage	MinIO (S3-compatible object storage)
Processing	dbt (data transformations), Snowflake
Orchestration	Apache Airflow (scheduled DAGs)
Visualization	Power BI (real-time dashboards)
Infrastructure	Docker, Docker Compose

Implementation Details

Kafka Producer Architecture: Built Python-based Kafka producers using confluent-kafka library to ingest live stock data from Finnhub WebSocket API. Implemented error handling, retry logic, and schema validation before publishing to Kafka topics. Each producer maintains connection pooling and graceful shutdown handling.

Event Streaming with Kafka: Configured multi-broker Kafka cluster with topic partitioning based on stock symbols for parallel processing. Implemented exactly-once semantics using idempotent producers and transactional consumers. Kafka Connect used for continuous consumption into MinIO Bronze layer.

Medallion Architecture:

Bronze Layer: Raw JSON events stored in MinIO with time-based partitioning (YYYY/MM/DD/HH). Immutable event store preserving original API responses.
Silver Layer: dbt models cleanse data by removing duplicates, enforcing schemas, and validating business rules (e.g., non-negative prices, valid timestamps).
Gold Layer: Aggregated analytics tables including OHLC (Open-High-Low-Close) candles, moving averages, volume indicators, and correlation matrices.

Airflow Orchestration: Built Airflow DAGs for:

System health monitoring and alerting
Backfill jobs for missed events
Automated Bronze → Silver → Gold promotions
Data quality checks and SLA monitoring

Power BI Integration: Connected Power BI to Snowflake Gold layer using DirectQuery for near-real-time dashboard updates. Dashboards display market trends, portfolio analytics, and trading signals.

Data Characteristics

Metric	Value
Event Volume	~10K+ events/minute during market hours
Latency	Sub-minute end-to-end (API → Dashboard)
Symbols Tracked	100+ stocks and indices
Data Format	JSON events → Parquet/Snowflake tables
Retention	Bronze: 2 years

Reliability & Edge Cases

Kafka replication factor 3: Ensures no data loss during broker failures
Idempotent producers: Prevents duplicate events during retries
Schema evolution: Avro/JSON schema registry for backward compatibility
API rate limiting: Exponential backoff and request queuing
Connection resilience: Automatic reconnection to Finnhub WebSocket
Monitoring: Prometheus metrics for lag, throughput, and error rates
Alerting: PagerDuty integration for pipeline failures and SLA breaches

Lessons Learned

Kafka partitioning strategy: Initial round-robin partitioning caused uneven load distribution. Switched to stock symbol-based partitioning which improved parallel processing and reduced consumer lag.

MinIO vs S3: Chose MinIO for local development and cost reduction while maintaining S3 API compatibility. Production deployment can seamlessly migrate to AWS S3.

dbt incremental models: Implemented incremental models with merge strategies to handle late-arriving events and out-of-order timestamps common in financial data streams.

Power BI DirectQuery limitations: DirectQuery mode has query timeout constraints. Added Snowflake materialized views to pre-aggregate heavy computations for dashboard performance.

Future Improvements

Add Apache Flink for complex event processing (CEP) and pattern detection
Implement machine learning models for price prediction and anomaly detection
Expand to cryptocurrency markets and alternative data sources
Add real-time alerting for trading signals and portfolio thresholds
Migrate to managed Kafka (Confluent Cloud/MSK) for production scale
Implement data lineage tracking with Apache Atlas or OpenMetadata
Add A/B testing framework for trading strategy backtesting