Which Transcription Model Should I Pick?

SnipSound's free transcription tools let you choose between five different AI models. They all run in your browser — none of your audio is uploaded — but they differ in size, speed, language support, and accuracy. Here's the no-nonsense guide.

Quick recommendation

Just speaking English? Use Whisper Tiny (40 MB). Fast, downloads quickly, good enough for most clear speech.
Speaking another language (Spanish, French, Mandarin, etc.)? Use Whisper Small (244 MB). Multilingual accuracy is dramatically better — the Tiny model produces garbled output on non-English speech.
Need maximum accuracy? Use Whisper Medium (769 MB). Best in-browser quality but the download is heavy (5–10 min on slow connections).

SnipSound auto-recommends the right model when you pick a source language. The picker switches automatically; you don't have to think about it. This page exists so you understand why.

The five models compared

Model	Size	Speed	Languages	Best for
Whisper Tiny Default	40 MB	Fastest	99 (English-biased)	Quick first try on English speech. Garbled on non-English.
Moonshine Tiny Experimental	28 MB	Fastest	English only	Newer architecture (Useful Sensors, 2024). Slightly smaller than Whisper Tiny. Worth testing against Tiny on your own English audio.
Whisper Base	75 MB	Fast	99 (English-biased)	Small upgrade from Tiny — still doesn't handle non-English well. Skip in most cases.
Whisper Small	244 MB	Medium	99 (good multilingual)	The quality cliff for non-English. Spanish, French, Mandarin, Arabic etc. work well here.
Whisper Medium	769 MB	Slow load	99 (excellent)	Best in-browser accuracy. Heavy download but a one-time cost — cached after.

Plain-English decision tree

Is your audio clear English speech? Like a podcast, voice memo, or interview in English, recorded in a quiet space.

→ Whisper Tiny

Is your audio in any other language? Spanish, French, Mandarin, German, Japanese, Hindi, Arabic, Portuguese, etc.

→ Whisper Small

Is there background noise, music, or accents? YouTube clips with music behind voice, accented speech, slightly noisy room.

→ Whisper Small or Medium

Do you need professional-grade accuracy? Legal transcripts, medical notes, research interviews where errors cost time.

→ Whisper Medium (or a paid service)

Just want to test how good our transcription is? First-time visitor, curious about the tool.

→ Whisper Tiny

How it works in your browser

SnipSound's transcription is built on Whisper — an open-source speech-recognition model from OpenAI — and its newer cousin Moonshine from Useful Sensors. Both are deep neural networks. Normally these run on powerful server GPUs. We run them on your computer's CPU via WebAssembly, which lets compiled C/Rust/C++ code execute inside your browser at near-native speed.

The first time you click Transcribe with a given model, your browser downloads the model file (40 MB to 769 MB depending on which one) from Hugging Face's public CDN. The download happens once per model per browser, then the model is cached in your browser's IndexedDB storage. Subsequent transcriptions reuse the cached model — no re-download, no network needed for inference.

When you actually transcribe an audio file:

Your browser resamples the audio to 16 kHz mono (the format Whisper expects).
The model processes the audio in 30-second sliding windows.
For each window, it outputs a sequence of word tokens with timestamps.
The result is stitched together and rendered as the transcript you see.

None of this involves a server call. The audio never leaves your computer. We have no way to access it.

Privacy guarantee

All transcription happens locally. The only network traffic is the one-time model download from Hugging Face's public CDN. After that, you can disconnect from the internet and transcription still works.

Why is this free?

The big paid transcription services (Otter, Rev, Happy Scribe, Trint) run AI models on their own servers. They have to pay for GPU time — every minute of audio you transcribe costs them real money. So they charge users $10–30/month or per-minute pricing.

SnipSound's wedge is the opposite. The model runs on your computer, so we have zero per-user cost. We don't need to pay for GPUs because we don't have any. The trade-off: we can only use models small enough to fit in a browser. Whisper Medium (769 MB) is the upper limit; the much-better Whisper Large (1.5 GB) is too big for most users to tolerate downloading.

If you need higher accuracy than Whisper Medium provides, that's when a paid service is genuinely worth it. We'd rather be transparent about the ceiling than oversell what we can do.

When you need a paid service instead

SnipSound's free transcription is great for:

Casual transcripts of podcasts, voice memos, lectures, voiceovers, YouTube clips
Quick rough drafts you'll edit afterward
Privacy-sensitive recordings (therapy sessions, confidential interviews, internal meetings)
Anyone who doesn't want to pay $10–30/month

It's not the right choice for:

Multi-speaker labels. "Speaker 1: Hi. Speaker 2: Hi." None of our models do speaker diarization.
Files longer than 60 minutes. Browser RAM is limited; we cap input at 60 minutes.
Legal/medical/research-grade accuracy. Whisper Medium is good but the gap to professional services is still meaningful.
Live real-time transcription. Our tools work on uploaded/recorded files; not on a live mic stream as you speak.
Specialized vocabulary. Medical, legal, and technical jargon often gets mangled. Paid services train custom vocabularies.

For those cases: Otter, Rev, Trint, Happy Scribe, Sonix. Or — if you're comfortable with the command line — run Whisper Large locally via MacWhisper (Mac) or whisper.cpp (Linux/Windows).

What about Moonshine?

Moonshine is a newer speech-recognition model from Useful Sensors, released in 2024. It's designed specifically for on-device transcription — smaller and faster than Whisper, with a focus on incremental (live) speech rather than 30-second chunks.

Useful Sensors' own benchmarks claim Moonshine Tiny hits 12.00% word-error-rate vs Whisper Tiny's 12.81% on their test set, with materially lower latency on short clips. We added it as an option so you can compare for yourself on your own audio.

The catch: Moonshine is English-only at the moment. Whisper is the multilingual choice. For any non-English audio, stick with Whisper Small or Medium.

A note on Whisper-tiny.en versus multilingual

OpenAI released two flavors of each Whisper size: an English-only variant (suffix .en) and a multilingual one. The English-only variants are slightly more accurate on English speech because they don't waste capacity on other languages.

We currently ship the multilingual variants of Tiny / Base / Small / Medium because most SnipSound users record in multiple languages — and the accuracy gap on pure English is marginal (Whisper Tiny vs Whisper Tiny.en is around 1 percentage point of WER difference). If you'd specifically benefit from English-only mode let us know.

Ready to try?

Transcribe Audio → Translate Audio to English → Record + Transcribe Voice →

Have a question or suggestion about which model fits your case? The SnipSound tools all run 100% in your browser — we don't see your audio or any analytics. Email us if something doesn't work.