Side-by-side YouTube player frames showing auto-translated vs. native English captions over the same Croatian podcast moment.
Usporedbe4 min čitanja

Why YouTube auto-translate fails on Croatian (and what that means if you license content from the region)

Three concrete failure modes, with measured WER from a 3-channel Croatian benchmark. If you're evaluating Croatian podcasts on the strength of YouTube's auto-translated captions, you are not seeing the asset.

If you work at a podcast network, content licensing agency, or distribution platform and you're scouting Croatian-language content, you've probably evaluated it via YouTube's auto-translated English captions. That's a reasonable shortcut. It's also wrong often enough to change your decision.

This is what the auto-translate path actually produces on Croatian, where it fails, and what those failures cost you if you're using it as a quality signal.

The pipeline you're seeing

YouTube auto-translate is two AI systems chained:

  1. ASR (speech-to-text) in the source language. For Croatian, this is YouTube's internal Croatian model, not your usual English Whisper.
  2. MT (machine translation) from Croatian to your target language. Google's own MT. Croatian is supported but classified as a lower-resource language in NLP literature (Joshi et al., 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World" puts BCS at Class 4/5, with 5 being the highest-resource), which is reflected in MT quality.

Errors compound. ASR mistakes propagate into MT. The English caption you read on YouTube Studio is the output of two error stages on a lower-resource language pair, not a single pass on a well-resourced one.

Failure mode 1: proper nouns

Croatian is rich in proper nouns that don't transliterate cleanly: people, brands, places, with diacritics (ć, č, š, ž, đ) that consumer ASR models often drop. From our Phase-0 baseline (3-channel benchmark, run 2026-05-10, full methodology in our Why Whisper struggles on Croatian note, currently Croatian-only), the Named Entity Recognition accuracy ranges from 75% to 100% depending on content type. The 75% case is a tech podcast heavy on local names and English-language brands. One in four proper nouns transcribed incorrectly, before any translation even begins.

What you see in YouTube auto-translate: the host name reads as nonsense English, the brand mentioned in the conversation is unrecognisable, the guest's company appears as a homophone. A listener can usually infer; an evaluator reading captions cannot.

Failure mode 2: code-switching

Croatian tech and business conversations switch in and out of English constantly: "submita PR i deploya na staging", "hands-on cloud code". The ASR layer expects monolingual Croatian; when it hears English, it tries to spell it phonetically in Croatian (or treats it as noise). The MT layer then translates that broken Croatian word back to "English" by best-fit dictionary lookup.

Net result: terms that the original speakers actually said in plain English come out as wrong English on the other side. We measure this directly: the channel with the heaviest code-switching in our benchmark set hits 12.48% WER vs 1.85% for clean Croatian narration. That gap is almost entirely about code-switching, not about Croatian per se.

For a content evaluator, technical Croatian content (arguably one of the more licensable segments for an English audience because the subject matter travels well) is also the segment most damaged by the auto-translate path. The two facts compound.

Failure mode 3: register, idiom, irony

Croatian uses irony, dialectal markers, sarcasm, and elaborate constructions that don't translate. MT models trained on general text don't know that a particular phrase is a regional joke or a TV-show reference. The auto-translate output is grammatical English that says the wrong thing.

This is where a competent native-English reader notices something is off but can't articulate what. The captions read smoothly. They just don't carry the original.

What this means for your evaluation workflow

If you license content based on listening to a clip with auto-translate on:

  • You're systematically underweighting technical and business shows (their best moments are in the code-switching and proper-noun-heavy segments your captions are mangling).
  • You're overweighting narrative and interview shows with one speaker and a measured pace (where the failures are smaller).
  • You miss the irony and personality of any conversational show. That's most podcasts.

The asset you're evaluating is not the asset you're licensing.

A practical workflow that costs nothing

Ask the rights-holder for the original Croatian transcript (an SRT file, or even a rough text). Run it through DeepL or a translation pass with a model that's seen more Croatian than YouTube's MT (Claude, GPT-4, similar). You'll see a different show.

If you're evaluating at scale, asking for native-language captions in your licensing intake form is reasonable. The Croatian podcasts that already publish English captions are typically the ones taking international distribution seriously, which is also a useful filter for sourcing.

Where Titlomat sits in this

We're a Croatian transcription + translation pipeline that produces native English captions for Croatian podcasts. We're not an evaluator or licensing platform. But the gap above is the reason our buyers exist: it's the difference between "Croatian content readable internationally" and "Croatian content invisible internationally."

If you're a network or agency working with Croatian-region creators and want native English subtitles on episodes you're evaluating, reach out at info@lumiverse.hr. We'll process a sample so you can see the same episode through both pipelines.