PySpark SQL Tutorial

Purpose

A hands-on tutorial covering PySpark SQL fundamentals. Built while learning distributed data processing—useful as a reference for common patterns.

Topics Covered

DataFrame Operations

# Load and inspect
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.printSchema()
df.describe().show()
 
# Transformations
df.select("col1", "col2") \
  .filter(df.col1 > 100) \
  .withColumn("new_col", df.col1 * 2)

SQL Queries

df.createOrReplaceTempView("my_table")
 
spark.sql("""
    SELECT dept, 
           AVG(salary) as avg_salary,
           COUNT(*) as count
    FROM my_table
    GROUP BY dept
    HAVING COUNT(*) > 10
    ORDER BY avg_salary DESC
""")

Window Functions

from pyspark.sql.window import Window
from pyspark.sql import functions as F
 
window = Window.partitionBy("dept").orderBy(F.desc("salary"))
 
df.withColumn("rank", F.rank().over(window)) \
  .withColumn("running_total", F.sum("salary").over(window))

Joins and Aggregations

# Different join types
df1.join(df2, df1.id == df2.id, "left")
df1.join(df2, ["common_key"], "inner")
 
# Complex aggregations
df.groupBy("category").agg(
    F.count("*").alias("count"),
    F.avg("value").alias("avg"),
    F.collect_list("item").alias("items")
)

Repository Structure

1
pyspark-tutorial/
2
├── notebooks/
3
│   ├── 01-basics.ipynb
4
│   ├── 02-dataframes.ipynb
5
│   ├── 03-sql-queries.ipynb
6
│   └── 04-window-functions.ipynb
7
├── data/
8
│   └── sample_data.csv
9
└── README.md

Key Technologies

Apache Spark, PySpark, SQL, Python, Jupyter