Help · Transcription Models

Which Transcription Model Should I Pick?

SnipSound's free transcription tools let you choose between five different AI models. They all run in your browser — none of your audio is uploaded — but they differ in size, speed, language support, and accuracy. Here's the no-nonsense guide.

Quick recommendation

SnipSound auto-recommends the right model when you pick a source language. The picker switches automatically; you don't have to think about it. This page exists so you understand why.

The five models compared

Model Size Speed Languages Best for
Whisper Tiny Default 40 MB Fastest 99 (English-biased) Quick first try on English speech. Garbled on non-English.
Moonshine Tiny Experimental 28 MB Fastest English only Newer architecture (Useful Sensors, 2024). Slightly smaller than Whisper Tiny. Worth testing against Tiny on your own English audio.
Whisper Base 75 MB Fast 99 (English-biased) Small upgrade from Tiny — still doesn't handle non-English well. Skip in most cases.
Whisper Small 244 MB Medium 99 (good multilingual) The quality cliff for non-English. Spanish, French, Mandarin, Arabic etc. work well here.
Whisper Medium 769 MB Slow load 99 (excellent) Best in-browser accuracy. Heavy download but a one-time cost — cached after.

Plain-English decision tree

Is your audio clear English speech? Like a podcast, voice memo, or interview in English, recorded in a quiet space.
→ Whisper Tiny
Is your audio in any other language? Spanish, French, Mandarin, German, Japanese, Hindi, Arabic, Portuguese, etc.
→ Whisper Small
Is there background noise, music, or accents? YouTube clips with music behind voice, accented speech, slightly noisy room.
→ Whisper Small or Medium
Do you need professional-grade accuracy? Legal transcripts, medical notes, research interviews where errors cost time.
→ Whisper Medium (or a paid service)
Just want to test how good our transcription is? First-time visitor, curious about the tool.
→ Whisper Tiny

How it works in your browser

SnipSound's transcription is built on Whisper — an open-source speech-recognition model from OpenAI — and its newer cousin Moonshine from Useful Sensors. Both are deep neural networks. Normally these run on powerful server GPUs. We run them on your computer's CPU via WebAssembly, which lets compiled C/Rust/C++ code execute inside your browser at near-native speed.

The first time you click Transcribe with a given model, your browser downloads the model file (40 MB to 769 MB depending on which one) from Hugging Face's public CDN. The download happens once per model per browser, then the model is cached in your browser's IndexedDB storage. Subsequent transcriptions reuse the cached model — no re-download, no network needed for inference.

When you actually transcribe an audio file:

  1. Your browser resamples the audio to 16 kHz mono (the format Whisper expects).
  2. The model processes the audio in 30-second sliding windows.
  3. For each window, it outputs a sequence of word tokens with timestamps.
  4. The result is stitched together and rendered as the transcript you see.

None of this involves a server call. The audio never leaves your computer. We have no way to access it.

Privacy guarantee

All transcription happens locally. The only network traffic is the one-time model download from Hugging Face's public CDN. After that, you can disconnect from the internet and transcription still works.

Why is this free?

The big paid transcription services (Otter, Rev, Happy Scribe, Trint) run AI models on their own servers. They have to pay for GPU time — every minute of audio you transcribe costs them real money. So they charge users $10–30/month or per-minute pricing.

SnipSound's wedge is the opposite. The model runs on your computer, so we have zero per-user cost. We don't need to pay for GPUs because we don't have any. The trade-off: we can only use models small enough to fit in a browser. Whisper Medium (769 MB) is the upper limit; the much-better Whisper Large (1.5 GB) is too big for most users to tolerate downloading.

If you need higher accuracy than Whisper Medium provides, that's when a paid service is genuinely worth it. We'd rather be transparent about the ceiling than oversell what we can do.

When you need a paid service instead

SnipSound's free transcription is great for:

It's not the right choice for:

For those cases: Otter, Rev, Trint, Happy Scribe, Sonix. Or — if you're comfortable with the command line — run Whisper Large locally via MacWhisper (Mac) or whisper.cpp (Linux/Windows).

What about Moonshine?

Moonshine is a newer speech-recognition model from Useful Sensors, released in 2024. It's designed specifically for on-device transcription — smaller and faster than Whisper, with a focus on incremental (live) speech rather than 30-second chunks.

Useful Sensors' own benchmarks claim Moonshine Tiny hits 12.00% word-error-rate vs Whisper Tiny's 12.81% on their test set, with materially lower latency on short clips. We added it as an option so you can compare for yourself on your own audio.

The catch: Moonshine is English-only at the moment. Whisper is the multilingual choice. For any non-English audio, stick with Whisper Small or Medium.

A note on Whisper-tiny.en versus multilingual

OpenAI released two flavors of each Whisper size: an English-only variant (suffix .en) and a multilingual one. The English-only variants are slightly more accurate on English speech because they don't waste capacity on other languages.

We currently ship the multilingual variants of Tiny / Base / Small / Medium because most SnipSound users record in multiple languages — and the accuracy gap on pure English is marginal (Whisper Tiny vs Whisper Tiny.en is around 1 percentage point of WER difference). If you'd specifically benefit from English-only mode let us know.

Ready to try?

Transcribe Audio → Translate Audio to English → Record + Transcribe Voice →

Have a question or suggestion about which model fits your case? The SnipSound tools all run 100% in your browser — we don't see your audio or any analytics. Email us if something doesn't work.