Sitemap

Member-only story

The RAG Pipeline From Hell: How I Survived My Worst Data Engineering Week

4 min readJun 4, 2025

--

Photo by Christian Erfurt on Unsplash

The Costly Mistakes I Made Parsing Unstructured Data for a RAG Project — and How You Can Avoid Them

When I was asked to build a data pipeline for a Retrieval-Augmented Generation (RAG) use case, I was excited.
Confident. Maybe even a little cocky.

“It’s just documents, right? A few PDFs, some Word files — how hard can it be?”

Fast forward two weeks: the system was hallucinating, the chatbot was spitting nonsense, and leadership was asking,
“Why is this taking so long? Can we trust this output?”

It was one of those moments where you feel your career flash before your eyes.

Here’s how I almost lost my job on that project — and what I learned that saved it.

📖 Act 1: The Document Dump

Our RAG use case was to build an internal chatbot for a consulting firm that could answer policy and legal questions using thousands of internal reports and client deliverables.

We got access to the data: ~3TB of unstructured documents from SharePoint, Teams, and email attachments.

There were:

  • PDFs with no text layer

--

--

Nnaemezue Obi-Eyisi
Nnaemezue Obi-Eyisi

Written by Nnaemezue Obi-Eyisi

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini

Responses (1)