Member-only story

PySpark vs. Spark SQL: A Love-Hate Relationship

Nnaemezue Obi-Eyisi
3 min readJan 31, 2025

--

I don’t know who this might offend, but I have to say it — I hate PySpark.

Before you come at me with pitchforks, let me explain.

PySpark feels clumsy and overly complex, with far too many functions and methods to import. It doesn’t quite know what it wants to be — is it Python? Is it SQL? The learning curve is frustrating, even though it’s undeniably powerful and highly extensible.

But here’s the thing: I love Spark SQL — and that’s where my struggle begins.

Why PySpark Feels Like a Mess

If you’re coming from a Python background, you expect things to behave in a certain way. But PySpark is not pure Python — it’s a distributed computing framework with quirks that can drive you crazy.

For example, something as simple as getting the last element of an array should be straightforward, right?

Well, not in PySpark. Unlike Python lists, negative indexing does not work.

from pyspark.sql.functions import element_at

df = spark.createDataFrame([(1, ["apple", "banana", "cherry"])], ["id", "items"])
df.withColumn("last_item", col("items")[-1]).show()

🔥 Boom! Error. You will get NULL. PySpark only supports positive indexing.

--

--

Nnaemezue Obi-Eyisi
Nnaemezue Obi-Eyisi

Written by Nnaemezue Obi-Eyisi

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini

Responses (1)