Navigating Row_Number() Non-Idempotency

Nnaemezue Obi-Eyisi
2 min readAug 16, 2023

In the realm of data engineering, challenges often present themselves in the most unexpected ways. Recently, I embarked on a journey with Databricks that led me to an intricate puzzle — one that took days of exploration and experimentation to crack: the enigmatic non-idempotency of the Row_Number() window function.

My goal seemed deceptively straightforward: process data in manageable 100-record chunks and dispatch these packets as messages to a designated message queue. This operation was driven by the memory limitations of the message queue system, necessitating careful segmentation of data. Armed with my trusty knowledge of the Row_Number() function, I began crafting a solution on my dataframe.

However, as I initiated a loop to process each batch of 100 records, an unexpected pattern emerged. Duplicate entries were making their way into the message queue, confounding my meticulous approach. Intriguingly, the root cause of this issue lay in the division of sequence numbers during runtime, driven by the partition count allocated to the Spark job.

To illustrate, consider a scenario where I had a dataset of 1000 records. The loop, designed to iterate over 100-record batches, looped back to the first 500 records after reaching the 1000th record. In effect, this dispatched the same set of 500 messages twice to the message queue.

This perplexing behavior led me to an intriguing discovery: the distributed nature of Spark was influencing the behavior of my code. The…

--

--

Nnaemezue Obi-Eyisi

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini