Whisper models, explained

What the family is

The Whisper family is a set of open speech-recognition models that turn audio into text. OpenAI released the first generation in 2022 and several improved generations since. The models are open-weight: anyone can download them, run them, and build on them. Modern Mac dictation apps either use these models directly or use derivatives optimised for Apple Silicon.

The reason the family matters for a dictation tool is that it makes on-device transcription practical on a laptop. Before Whisper, the best transcription models were either proprietary cloud services or research artefacts too heavy to run locally. Whisper closed that gap.

The sizes

The models come in a small handful of sizes. Each is a complete working speech recogniser; the difference is the number of parameters, which trades latency and memory against accuracy on the hard cases.

The smaller variants — typically called “tiny” and “base” — are fast, load in well under a second, and produce text that is good enough for clear speech in a quiet room. They miss more on accents and on technical jargon. They are perfect for a free-tier dictation tool.

The medium variants run a few times slower and use a few times more memory. They are noticeably better on accents and on noisier rooms.

The large variants are the most accurate; they also use the most memory and the most battery. On Apple Silicon they still run in real time on modern hardware, but the difference is felt — a noticeable delay between the key release and the words appearing.

Which one should you pick

For most dictation cases the small or base size is enough. A native English speaker at a quiet desk, dictating into commit messages and Slack replies, will not see a quality difference that justifies the extra battery and latency cost of a larger model.

The honest cases for the larger models are three:

A second-language English speaker dictating in English, where the model has to handle accent variation a small model trips on.
A noisy environment — a coffee shop, an open office, a busy household.
Long-form dictation where every percent of accuracy compounds across several thousand words.

A practical default: ship the small model, let the user opt into the larger one when they actually need it. That is what Voiacast does — Free is the small model, Pro adds the larger ones as a setting.

Languages

The Whisper family is multilingual. The same model can transcribe English, Spanish, French, German, Dutch, Italian, Portuguese, and Mandarin — and dozens of others — without switching models. The language is detected automatically from the audio. There is no “set language” step; you just talk.

The smaller models are slightly less accurate on languages other than English because their training data was English-dominant. The larger models close that gap. For a multilingual workflow — a Dutch speaker who code-switches into English several times a paragraph — the larger model on Pro is a more honest pick.

What “running on Apple Silicon” actually means

Modern Macs ship a Neural Engine — a dedicated chip designed for neural-network inference. Whisper models compiled for that target run faster and use less energy than the same models on the CPU. Modern Mac dictation apps use a runtime that ships the Apple-optimised version of the model and falls back to CPU for older hardware. The performance ceiling is high enough that a small model produces text faster than a human can read it.