Which Transcription Model Should I Pick?
SnipSound's free transcription tools let you choose between five different AI models. They all run in your browser — none of your audio is uploaded — but they differ in size, speed, language support, and accuracy. Here's the no-nonsense guide.
Quick recommendation
- Just speaking English? Use
Whisper Tiny(40 MB). Fast, downloads quickly, good enough for most clear speech. - Speaking another language (Spanish, French, Mandarin, etc.)? Use
Whisper Small(244 MB). Multilingual accuracy is dramatically better — the Tiny model produces garbled output on non-English speech. - Need maximum accuracy? Use
Whisper Medium(769 MB). Best in-browser quality but the download is heavy (5–10 min on slow connections).
SnipSound auto-recommends the right model when you pick a source language. The picker switches automatically; you don't have to think about it. This page exists so you understand why.
The five models compared
| Model | Size | Speed | Languages | Best for |
|---|---|---|---|---|
| Whisper Tiny Default | 40 MB | Fastest | 99 (English-biased) | Quick first try on English speech. Garbled on non-English. |
| Moonshine Tiny Experimental | 28 MB | Fastest | English only | Newer architecture (Useful Sensors, 2024). Slightly smaller than Whisper Tiny. Worth testing against Tiny on your own English audio. |
| Whisper Base | 75 MB | Fast | 99 (English-biased) | Small upgrade from Tiny — still doesn't handle non-English well. Skip in most cases. |
| Whisper Small | 244 MB | Medium | 99 (good multilingual) | The quality cliff for non-English. Spanish, French, Mandarin, Arabic etc. work well here. |
| Whisper Medium | 769 MB | Slow load | 99 (excellent) | Best in-browser accuracy. Heavy download but a one-time cost — cached after. |
Plain-English decision tree
How it works in your browser
SnipSound's transcription is built on Whisper — an open-source speech-recognition model from OpenAI — and its newer cousin Moonshine from Useful Sensors. Both are deep neural networks. Normally these run on powerful server GPUs. We run them on your computer's CPU via WebAssembly, which lets compiled C/Rust/C++ code execute inside your browser at near-native speed.
The first time you click Transcribe with a given model, your browser downloads the model file (40 MB to 769 MB depending on which one) from Hugging Face's public CDN. The download happens once per model per browser, then the model is cached in your browser's IndexedDB storage. Subsequent transcriptions reuse the cached model — no re-download, no network needed for inference.
When you actually transcribe an audio file:
- Your browser resamples the audio to 16 kHz mono (the format Whisper expects).
- The model processes the audio in 30-second sliding windows.
- For each window, it outputs a sequence of word tokens with timestamps.
- The result is stitched together and rendered as the transcript you see.
None of this involves a server call. The audio never leaves your computer. We have no way to access it.
All transcription happens locally. The only network traffic is the one-time model download from Hugging Face's public CDN. After that, you can disconnect from the internet and transcription still works.
Why is this free?
The big paid transcription services (Otter, Rev, Happy Scribe, Trint) run AI models on their own servers. They have to pay for GPU time — every minute of audio you transcribe costs them real money. So they charge users $10–30/month or per-minute pricing.
SnipSound's wedge is the opposite. The model runs on your computer, so we have zero per-user cost. We don't need to pay for GPUs because we don't have any. The trade-off: we can only use models small enough to fit in a browser. Whisper Medium (769 MB) is the upper limit; the much-better Whisper Large (1.5 GB) is too big for most users to tolerate downloading.
If you need higher accuracy than Whisper Medium provides, that's when a paid service is genuinely worth it. We'd rather be transparent about the ceiling than oversell what we can do.
When you need a paid service instead
SnipSound's free transcription is great for:
- Casual transcripts of podcasts, voice memos, lectures, voiceovers, YouTube clips
- Quick rough drafts you'll edit afterward
- Privacy-sensitive recordings (therapy sessions, confidential interviews, internal meetings)
- Anyone who doesn't want to pay $10–30/month
It's not the right choice for:
- Multi-speaker labels. "Speaker 1: Hi. Speaker 2: Hi." None of our models do speaker diarization.
- Files longer than 60 minutes. Browser RAM is limited; we cap input at 60 minutes.
- Legal/medical/research-grade accuracy. Whisper Medium is good but the gap to professional services is still meaningful.
- Live real-time transcription. Our tools work on uploaded/recorded files; not on a live mic stream as you speak.
- Specialized vocabulary. Medical, legal, and technical jargon often gets mangled. Paid services train custom vocabularies.
For those cases: Otter, Rev, Trint, Happy Scribe, Sonix. Or — if you're comfortable with the command line — run Whisper Large locally via MacWhisper (Mac) or whisper.cpp (Linux/Windows).
What about Moonshine?
Moonshine is a newer speech-recognition model from Useful Sensors, released in 2024. It's designed specifically for on-device transcription — smaller and faster than Whisper, with a focus on incremental (live) speech rather than 30-second chunks.
Useful Sensors' own benchmarks claim Moonshine Tiny hits 12.00% word-error-rate vs Whisper Tiny's 12.81% on their test set, with materially lower latency on short clips. We added it as an option so you can compare for yourself on your own audio.
The catch: Moonshine is English-only at the moment. Whisper is the multilingual choice. For any non-English audio, stick with Whisper Small or Medium.
A note on Whisper-tiny.en versus multilingual
OpenAI released two flavors of each Whisper size: an English-only variant (suffix .en) and a multilingual one. The English-only variants are slightly more accurate on English speech because they don't waste capacity on other languages.
We currently ship the multilingual variants of Tiny / Base / Small / Medium because most SnipSound users record in multiple languages — and the accuracy gap on pure English is marginal (Whisper Tiny vs Whisper Tiny.en is around 1 percentage point of WER difference). If you'd specifically benefit from English-only mode let us know.
Ready to try?
Have a question or suggestion about which model fits your case? The SnipSound tools all run 100% in your browser — we don't see your audio or any analytics. Email us if something doesn't work.