In the post on search I gave you the clean version: type a word, get every place it was said, tap to jump to that moment. That’s true now. But I made it sound smooth, and for a while it wasn’t. The search had a bug that wore me down — and tracking it down taught me one of the more useful lessons in this series.

The symptom was maddening. You’d search a word, see a result, tap “play from here” — and the audio would start playing twenty minutes before the word you searched for. The search found the right thing. It just sent you to the wrong place in the recording. Sometimes by seconds, sometimes by twenty whole minutes.

For a library meant for study, that’s not a small thing. Here’s what was going on.

Two Tools, Two Shapes of Data

The root cause goes back to the two transcription tools I used. They don’t just differ in price and quality. They hand back data in completely different shapes.

When a recording is transcribed, the text comes broken into chunks called segments, and each segment has a timestamp. The trouble is that “segment” means something different depending on the tool:

  • Whisper breaks a recording into hundreds of small segments — roughly sentence by sentence — each with its own timestamp. Fine-grained. Search a word, land within a few seconds of it.
  • ElevenLabs (which I used for its speaker separation) breaks the same recording into just three to five huge segments — basically one per speaker turn. A single segment might be twenty minutes long.

Now you can see the bug. My search jumped to the start of whichever segment held your word. For a Whisper file, a segment is one sentence, so that’s accurate. For an ElevenLabs file, the segment holding your word might be a twenty-minute block — so “jump to the start of the segment” drops you twenty minutes early. Same code, very different result, depending on which tool had transcribed that file.

The Deeper Trap: One Thing Doing Two Jobs

If that were the whole problem, it’d be a quick fix. It wasn’t, because those segments were quietly doing two jobs at once, and that’s where the real pain was.

The segments were used:

  1. For speaker separation — labeling who’s talking (the thing ElevenLabs is good at), and
  2. As the thing that search reads from.

One piece of data, two jobs. And the two jobs wanted different things. Speaker separation wants big chunks (one per speaker). Search wants tiny chunks (one per sentence). You can’t do both well with the same data.

It got worse. Every time I corrected the transcript text — fixing a misheard word — the segments (which search actually read from) stayed old. So a word I’d carefully fixed in the transcript would still show up wrong in search, because search wasn’t reading the corrected text. It was reading the old segments. Keeping the transcript and the segments in step, file after file, became a grind — and one that burned through API credits fast. I once watched a prepaid balance disappear in a few hours, mostly fighting this.

The Fix, in Two Moves

The fix came in two pieces, and the second is the one I’d most want a beginner to hear.

Move 1 — Estimate the missing timestamps. ElevenLabs gave me a big segment with a start time and an end time, but no per-sentence timing inside it. So I had the AI estimate them: spread timestamps across the segment based on how far into the text each sentence falls (a linear interpolation). It’s not perfect — the start of a long segment can still be off by around thirty seconds — but thirty seconds is a long way from twenty minutes. Good-enough today beat perfect someday, and I was fine with that trade.

Move 2 — Split the two jobs. This was the real cure. I stopped making one piece of data do two jobs:

  • Segments are now used only for speaker separation (and only for the ElevenLabs files that need it).
  • Search reads the corrected, sentence-level text directly — not the segments.

The moment I split those two jobs, the whole class of bugs went away. Correcting the transcript no longer left search reading old data, because search now read the same corrected text I was editing. The sync problem was gone. The API bleed stopped. And search results finally pointed where they should.

The Lesson: One Job Per Thing

Strip away the audio details and here’s the idea. It’s one of the more useful things I’ve picked up building software with AI. When one thing is quietly doing two jobs, your bugs multiply, because a change that helps one job quietly breaks the other. The fix is usually to split the jobs — give each one its own thing.

I didn’t know the textbook name for this while I was living it (engineers call it “separation of concerns”). I just felt the pain of one thing being pulled in two directions, and worked my way to splitting it. That’s the part worth telling a fellow non-coder: you don’t need the words to feel the problem. I understood my data — what segments were, why two tools shaped them differently, why correcting text didn’t reach search. The AI wrote the actual code. My job was to understand the shape of the problem well enough to tell it what to split.

That split of labor — I understand the problem, the AI handles the building — is most of how someone like me builds real software. The bugs are where I learn it.

If you like the under-the-hood version of this stuff, please subscribe. The next one is about a bug that wasted hours of my life and hides in almost every web app: caching.


Key Takeaways

  • A search can find the right text but still send you to the wrong place — here, sometimes 20 minutes off.
  • The cause: two transcription tools produce different “segment” shapes — Whisper many tiny ones, ElevenLabs a few huge ones — and search jumped to the segment’s start.
  • The deeper bug was overlap: segments did both speaker separation and search, so correcting text left search reading old data (and burned API credits).
  • Fix 1: estimate per-sentence timestamps — good-enough beat perfect someday.
  • Fix 2 (the real cure): split the jobs — segments for speaker labels only; search reads the corrected text directly.
  • When one thing does two jobs, bugs multiply. Split the jobs. You don’t need the jargon to feel the problem.