Short answer. For YouTube uploads, use SRT. For your own web player, use VTT. If you download from YouTube Studio you'll get .sbv. Convert that to SRT.
YouTube accepts six formats. Three are in widespread use: SRT, VTT, and .sbv. The differences are syntactic. The text the viewer sees is the same. But when you're writing your own pipeline or debugging weird rendering, the small details start to hurt.
SRT (SubRip)
The oldest and most widespread. Made by the SubRip program for ripping captions off DVDs.
1
00:00:01,200 --> 00:00:04,500
Good day and welcome to the podcast.
2
00:00:04,500 --> 00:00:07,800
Today we're talking about a single topic.
Syntax:
- Numeric index starting at 1.
- Time:
HH:MM:SS,mmm. Milliseconds separated by a comma, not a period. -->arrow between start and end.- Text on one or more lines.
- Blank line between segments.
What it supports:
- Plain text, basic HTML tags
<b>,<i>,<u>. - Colors via
<font color="...">(outdated style but functional). - UTF-8 encoding (mandatory for Croatian diacritics. Without it
čbecomes?).
What it does NOT support:
- On-screen position.
- CSS styling.
- Multiple speakers in the same segment with different formatting.
YouTube accepts it. Most players (VLC, mpv, web players) accept it. Titlomat exports it.
VTT (WebVTT)
Developed by the W3C as a web standard. It was meant to replace SRT on the web, but in practice the two coexist.
WEBVTT
00:00:01.200 --> 00:00:04.500
Good day and welcome to the podcast.
00:00:04.500 --> 00:00:07.800
Today we're talking about a single topic.
Differences from SRT:
- Mandatory
WEBVTTheader at the top. - Time:
HH:MM:SS.mmm. Milliseconds with a period, not a comma. - Segment indexing is optional (and typically omitted).
- Supports NOTE comments that don't render.
- CSS-style positioning and classes (
align:left,line:80%).
When we prefer VTT:
- The HTML5
<track>element only accepts VTT. - Streaming protocols (HLS, DASH) use VTT as the standard.
- Text that needs to be positioned outside the bottom-center area (e.g. dialogue at upper-left).
YouTube accepts it. Our pipeline exports it on request.
.sbv (YouTube SubViewer)
YouTube's internal format. The simplest syntax of the three:
00:00:01.200,00:00:04.500
Good day and welcome to the podcast.
00:00:04.500,00:00:07.800
Today we're talking about a single topic.
Differences:
- Start and end time on the same line, separated by a comma.
- No index, no header.
- No support for formatting, positioning, or tags.
When it makes sense: downloading from YouTube Studio (YouTube's default export is .sbv), quick text-only edits. Most other players don't read it.
Practical pitfalls
Croatian diacritics
Encoding must be UTF-8 without BOM. Excel-exported CSVs often flip to Windows-1250 or UTF-8 with a BOM. YouTube then renders č as Ä?. Check in an editor that shows encoding (VS Code, bottom-right corner).
Periods and commas in time
SRT uses a comma for milliseconds, VTT uses a period. Many parsers are tolerant, but stricter pipelines fail on the wrong separator. If you're building your own SRT and it won't render, check whether it's a comma.
Overlapping segments
The start time of segment N+1 must not be earlier than the end time of segment N. Otherwise YouTube shows both in parallel (rarely the desired effect) or rejects the upload. Our pipeline has a sanity check; a hand-edited export usually doesn't.
Maximum segment length
Industry norm: ≤ 42 characters per line, ≤ 2 lines per segment. YouTube doesn't reject longer ones, but the rendering looks ugly on mobile. Our writer currently wraps at 40 characters before adding the SPEAKER_NN: prefix, which sometimes pushes a wrap over 42. Tracked in the backlog.
What to pick in practice
- YouTube upload: SRT. The safest default.
- Your own web player: VTT (because of the
<track>tag). - Download from YouTube for editing: .sbv (easier to parse), but convert it to SRT before the next stage.
Titlomat exports SRT by default and VTT on demand. You don't have to do the conversion. If the viewer clicks CC, they see the captions without a format difference.
Got your own pipeline?
If you're writing a pipeline that produces SRT from audio, a custom dictionary post-processor is almost always a win. Especially in Croatian. Detailed guide on writing a dictionary. Or connect Titlomat and skip the whole pipeline-engineering branch. Try it free.



