Kubernetes Fundamentals for Data Engineers — What You Actually Need to Know

The essential Kubernetes concepts for data engineers - pods, deployments, jobs, and resource management for Spark and Airflow workloads.

Mar 12, 2026· projects · 3 minutes

Kubernetes Fundamentals for Data Engineers — What You Actually Need to Know

You don’t need to be a Kubernetes administrator to work effectively as a data engineer on a platform that runs on K8s. But you do need a working mental model of how it operates and enough fluency to debug, deploy, and reason about your workloads. Here’s the essential subset.

The Core Mental Model

Kubernetes is a container orchestrator. You describe the desired state of your workload (how many replicas, how much CPU/memory, which image to run), and Kubernetes continuously works to make reality match that description. If a container crashes, K8s restarts it. If a node dies, K8s reschedules your workload onto a healthy node.

Everything in Kubernetes is a resource described in YAML. You apply YAML manifests, and the system converges toward what you described.

The Resources That Matter for Data Workloads

Pod: The smallest deployable unit. A pod runs one or more containers. For most data engineering work, think of a pod as “one running instance of your pipeline container.”

Deployment: Manages a set of identical pods. You specify a replica count and an image, and the Deployment ensures that many pods are always running. Use Deployments for long-running services like an API that serves model predictions.

Job: Runs a pod to completion and then stops. This is what you want for batch data pipelines — run the ETL, exit successfully, done. If it fails, Kubernetes can retry it based on your backoffLimit.

CronJob: A Job on a schedule. Think of it as a lightweight alternative to Airflow for simple scheduled tasks.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-etl
spec:
  schedule: "0 6 * * *"       # 6am UTC daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: etl
            image: us-central1-docker.pkg.dev/my-project/pipelines/etl:v1.0.0
            resources:
              requests:
                memory: "2Gi"
                cpu: "500m"
              limits:
                memory: "4Gi"
                cpu: "1000m"
            env:
            - name: BQ_DATASET
              value: "curated"
          restartPolicy: OnFailure
      backoffLimit: 3

Service: Exposes pods on a stable network endpoint. If your pipeline includes an API component, a Service gives it a consistent DNS name and load balances across pod replicas.

ConfigMap and Secret: Externalize configuration and sensitive values (API keys, database passwords) from your container image. Mount them as environment variables or files.

Resource Requests and Limits — Get This Right

This is where most data engineering issues on Kubernetes start. Every container should declare:

Requests: The minimum resources Kubernetes guarantees. The scheduler uses this to decide which node can run your pod.
Limits: The maximum resources your container can consume. If your container exceeds the memory limit, Kubernetes kills it (OOMKilled).

resources:
  requests:
    memory: "2Gi"
    cpu: "500m"    # 0.5 cores
  limits:
    memory: "4Gi"
    cpu: "1000m"   # 1 core

For data pipelines, set memory limits generously. A Spark executor or Pandas job that OOMKills at 90% completion is worse than slightly over-provisioning.

Essential kubectl Commands

# See what's running
kubectl get pods -n data-pipelines
kubectl get jobs -n data-pipelines

# Check why a pod failed
kubectl describe pod <pod-name> -n data-pipelines
kubectl logs <pod-name> -n data-pipelines

# See previous container logs (after a crash)
kubectl logs <pod-name> -n data-pipelines --previous

# Port-forward to a service for local debugging
kubectl port-forward svc/my-api 8080:80 -n data-pipelines

# Apply a manifest
kubectl apply -f my-job.yaml

GKE vs. EKS — What Differs

If you’re on GCP, you’ll use GKE (Google Kubernetes Engine). On AWS, EKS (Elastic Kubernetes Service). The Kubernetes API is identical — your YAML manifests are portable. The differences are in networking, IAM integration, and managed add-ons.

On GKE, Workload Identity lets pods authenticate to GCP services (BigQuery, GCS) using a Kubernetes service account mapped to a Google service account — no key files needed. On EKS, the equivalent is IAM Roles for Service Accounts (IRSA).

When Kubernetes is Overkill

For simple scheduled SQL transformations, Airflow + BigQuery is simpler. For PySpark batch jobs, Dataproc Serverless avoids K8s entirely. Kubernetes shines when you need custom containers, long-running services, or fine-grained resource control that managed services don’t offer.

Takeaway: Kubernetes isn’t magic — it’s a declarative system for running containers reliably. Know Pods, Jobs, Deployments, resource requests/limits, and basic kubectl, and you’ll be effective as a data engineer on any K8s-based platform.

Docker for Data Engineers — Containerizing Python Pipelines

Build reproducible data pipelines with Docker. Covers multi-stage builds, dependency management, and patterns for PySpark and Airflow containers.
Designing a Data Lakehouse on GCP with BigLake

Unify your data lake and warehouse with BigLake. Query Parquet and ORC files in Cloud Storage directly from BigQuery with fine-grained access control.
Kafka vs. Pub/Sub — Choosing a Streaming Backbone for Your Data Platform

A hands-on comparison of Apache Kafka and Google Pub/Sub covering throughput, ordering guarantees, ecosystem, and when to use each.

Kubernetes Fundamentals for Data Engineers — What You Actually Need to Know

Kubernetes Fundamentals for Data Engineers — What You Actually Need to Know

The Core Mental Model

The Resources That Matter for Data Workloads

Resource Requests and Limits — Get This Right

Essential kubectl Commands

GKE vs. EKS — What Differs

When Kubernetes is Overkill

More posts

Docker for Data Engineers — Containerizing Python Pipelines

Designing a Data Lakehouse on GCP with BigLake

Kafka vs. Pub/Sub — Choosing a Streaming Backbone for Your Data Platform