Member-only story

PySpark vs. Spark SQL: A Love-Hate Relationship

3 min readJan 31, 2025

I don’t know who this might offend, but I have to say it — I hate PySpark.

Before you come at me with pitchforks, let me explain.

PySpark feels clumsy and overly complex, with far too many functions and methods to import. It doesn’t quite know what it wants to be — is it Python? Is it SQL? The learning curve is frustrating, even though it’s undeniably powerful and highly extensible.

But here’s the thing: I love Spark SQL — and that’s where my struggle begins.

Why PySpark Feels Like a Mess

If you’re coming from a Python background, you expect things to behave in a certain way. But PySpark is not pure Python — it’s a distributed computing framework with quirks that can drive you crazy.

For example, something as simple as getting the last element of an array should be straightforward, right?

Well, not in PySpark. Unlike Python lists, negative indexing does not work.

from pyspark.sql.functions import element_at

df = spark.createDataFrame([(1, ["apple", "banana", "cherry"])], ["id", "items"])
df.withColumn("last_item", col("items")[-1]).show()

🔥 Boom! Error. You will get NULL. PySpark only supports positive indexing.

Solution?

To get the last item in an array, you need to:

Use size() to get the array length.
Subtract 1 to adjust for zero-based indexing.

from pyspark.sql.functions import size, col

df.withColumn("last_item", col("items")[size(col("items")) - 1]).show()

Honestly? Annoying.

Why I Love Spark SQL

Spark SQL, on the other hand, just makes sense if you come from a SQL background. The syntax is familiar, intuitive, and far less verbose than PySpark’s DataFrame API.

Compare these two approaches for filtering a dataset:

🔹 Spark SQL:

SELECT name, age FROM users WHERE age > 30;

🔹 PySpark DataFrame API:

df.filter(df.age > 30).select("name", "age").show()

PySpark vs. Spark SQL: A Love-Hate Relationship

Why PySpark Feels Like a Mess

Solution?

Why I Love Spark SQL

Create an account to read the full story.

Written by Nnaemezue Obi-Eyisi

Responses (1)

More from Nnaemezue Obi-Eyisi

🚀 Is Apache Spark Really Dying? Let’s Talk

The world of data engineering moves fast. Every few months, a new tool emerges, claiming to be the next big thing.

8 Timeless Data Engineering Optimization Techniques That Work Across Any Tech Stack

Data engineering is an ever-evolving field, with new tools and frameworks emerging rapidly. However, no matter the technology stack, some…

Designing Cost-Optimized Data Pipelines esp. Spark

I was once asked in an interview what Spark cluster configuration I would use to ingest a 100GB file into a data lake. My clarifying…

Table-Driven Design in Data Engineering: Making Your Code Dynamic

One of the most powerful yet often overlooked techniques in data engineering is table-driven design. This approach involves creating…

Recommended from Medium

100 Days of Data Engineering on Databricks Day 39: Understanding Joins in Spark 4.0

Joins are fundamental to data processing in Spark, enabling the combination of multiple datasets. Spark employs different join strategies…

How to Read a 100GB File in PySpark Without Breaking Your Cluster

Working with 100GB+ datasets is a reality in modern data engineering. But reading a 100GB file efficiently — without blowing up your…

Databricks System Tables — A Bunch of Queries

🔥 PySpark 3.5.4: The Must-Know Features That Will Supercharge Your Data Processing 🚀

PySpark just got a major upgrade with version 3.5.4, and trust me — you don’t want to miss these game-changing features! Whether you’re a…

Implementing Unity Catalog with Medallion Architecture: A Mini Project

Project Description: Enable a Databricks workspace with Unity Catalog for centralized data governance and access control. Implement a…

How Databricks Solves the Small File Problem with Optimize and Auto-Optimization!