Why on-device still matters

On-device speech recognition is no longer the only path to good accuracy. It is still the right default. A working note on why.

Jamie van der Pijll · 11 May 2026

privacy
engineering

A reasonable objection to an on-device dictation tool in 2026 sounds like this: “The frontier models in the cloud are good enough that you do not need a local fallback. Just route the audio through OpenAI or Groq and let the model do the work.”

The objection is half right. The accuracy gap between the best on-device model and the best cloud model has narrowed enough that, on clean speech in a quiet room, you would struggle to pick the winner from a transcript alone. The other half of the objection — that the gap is the thing that matters — is, I think, the wrong reading of the last three years.

What changed and what did not

What changed: a frontier-scale speech model can now transcribe almost anything in almost any language at near-human accuracy. The remaining mistakes are dominated by edge cases — strong accents, heavy ambient noise, multi-speaker overlap. The model itself, not the bandwidth, is no longer the bottleneck.

What did not change: where the audio goes. A cloud transcription service has to receive the audio, transcribe it, and return the text. The audio sits — for some interval — on infrastructure outside the user’s control. The transcript, in most pricing models, sits there too, for accuracy research and abuse detection. Every cloud provider publishes a retention policy; every retention policy is longer than “none”.

The fact that the retention policy is “30 days” or “for research” or “only when flagged” is fine. The fact that the policy exists is the fact that matters. An on-device tool has no policy because there is no audio to retain.

What that means in practice

I am not arguing that nobody should ever send audio to a cloud transcription service. Some workflows fit that shape. A multilingual support agent dealing with the hardest five percent of calls is right to reach for a frontier model. A researcher transcribing field recordings in a language a small local model handles poorly is right to reach for cloud transcription. The cloud is the right answer for some of the work.

The on-device argument is that the cloud should not be the default shape for the rest of the work. The dictation I do in the average working day — a commit message, an email, a Slack reply, a paragraph of design doc — is the precisely-shaped task where the marginal accuracy of a frontier model does not move my outcomes and the privacy footprint of routing it through a vendor’s server does.

A useful question: of the dictation you do today, what fraction is in the five-percent tail where you actually need a frontier model? For most people I have talked to, the honest answer is small. For everything else, on-device is not a compromise; it is the better default.

The economics

A point that does not get said enough. Cloud transcription has a real per-minute cost the provider has to recoup. That cost shows up either as a per-seat subscription or as a usage quota. The economics push the provider towards a charging model where the user pays for every minute they dictate.

On-device transcription has no per-minute cost. The compute is on the user’s laptop, the laptop was already paid for, and the marginal cost of one more minute of dictation is the battery it consumes. Charging per minute makes no economic sense; charging once for the software that runs locally does.

This is why the tools that ship cloud-first end up with subscription pricing, and the tools that ship on-device-first can ship one-time pricing. It is a consequence of the shape, not a marketing choice.

Where bring-your-own-key lands

The middle option that I think is honest is bring-your-own-key cloud transcription. The dictation tool stays local by default. When the user wants to reach for a frontier model — for the hard cases — they plug in their own API key for OpenAI or Groq. The audio goes from their Mac to the provider they already pay; the dictation tool is the local glue, not a billed intermediary.

This is the shape I picked for Voiacast Pro. The user gets the frontier-model option when they need it, the provider relationship is direct, and the dictation tool stays out of the audio path for the ninety-five percent of cases where on-device is already enough.

The trust argument

There is one more reason on-device matters that nobody quite says out loud. A small, on-device, open-source-friendly dictation app can be audited by the user. The license-key validation request can be inspected. The custom dictionary file can be read with cat. The audio path can be traced through lsof and tcpdump. A user who is worried about a specific question — “is this dictation tool sending my audio anywhere?” — can answer that question themselves.

A cloud dictation service answers the same question with a privacy policy. Privacy policies are written by lawyers and updated by product managers. They are not a substitute for a verifiable audio path.

I think both can coexist. A cloud service for the workflows that need the frontier model and accept the policy. An on-device tool for the workflows that do not. The error is to treat the cloud as the only sensible default in 2026. The error is the default, not the cloud.

That is the argument behind why Voiacast’s default is local and stays local.