Is my audio uploaded to a server?

No. The transcription runs entirely in your browser using an on-device Whisper AI model. Your audio never leaves your computer. The only thing downloaded from the internet is the AI model itself, fetched once from a CDN and then cached.

What audio formats can I transcribe?

Any format your browser can decode — MP3, WAV, M4A/AAC, OGG, FLAC, and WebM all work. The file is decoded and down-sampled to 16 kHz mono in your browser before it is fed to the model.

Which model should I pick?

Balanced (Whisper base, multilingual) is the default and handles Japanese and many other languages. Fast (Whisper tiny.en) is the smallest and quickest but English only. Accurate (Whisper large-v3-turbo) gives the best quality, including Japanese, at the cost of a larger first download — use it on a WebGPU-capable browser. There are also two ultra-light Moonshine models (MIT) for short English clips: they are the smallest and fastest but return plain text only (no timestamps or subtitles). For Japanese, or when you need subtitles, pick a Whisper model.

Can it output subtitles (SRT / VTT)?

Yes. After transcribing you can switch between plain text, timestamped lines, SRT, and WebVTT, and download or copy any of them. SRT and VTT are ready to load into video editors and players as subtitle tracks.

Why is the first run slow?

The first time you use a model, the browser downloads its weights (roughly 120 MB for Fast, 200 MB for Balanced, 760 MB for Accurate) and caches them. After that, runs are fast and work even offline. Long audio is processed in 30-second chunks, so longer files take proportionally longer — a WebGPU browser (recent Chrome or Edge) is much faster than the CPU fallback.

Is it free, and can I use the results commercially?

Yes. The tool is free and runs locally. Whisper is released by OpenAI under the MIT license and Transformers.js under Apache-2.0, both of which permit commercial use, so the transcript is yours to use.

Accuracy depends on the model and the audio. Clear speech transcribes well; heavy noise, overlapping speakers, or strong accents lower accuracy. For important Japanese audio, prefer the Accurate model and review the output.

Transcribe audio to text, free & private.

Transcribe audio to text in your browser

This tool turns an audio file into text — plain transcript, timestamped lines, or ready-to-use subtitles (SRT / VTT) — using OpenAI's Whisper model running directly on your device. Drop in an MP3, WAV, M4A, OGG, FLAC, or WebM file and get the text back without uploading anything. Your audio never leaves your browser; only the AI model is downloaded (once) from a CDN, then everything runs locally.

How it works

The tool runs an open-source speech-recognition model — Whisper (OpenAI) or the lightweight Moonshine (Useful Sensors), both MIT-licensed — in your browser through Transformers.js, inside a Web Worker so the page never freezes. Your file is decoded and down-sampled to 16 kHz mono audio, split into 30-second chunks, and transcribed chunk by chunk. You pick the model that matches your language and quality needs:

Model	Languages	First download	Subtitles	Best for
Fast (`whisper-tiny.en`)	English only	~120 MB	Yes	Quick English drafts, low-power devices
Balanced (`whisper-base`)	Multilingual, incl. Japanese	~200 MB	Yes	Everyday default
Accurate (`whisper-large-v3-turbo`)	Multilingual, incl. Japanese	~760 MB	Yes	Highest quality; WebGPU recommended
Ultra-light (`moonshine-tiny`)	English only	~75 MB	No	Short English clips, fastest, plain text only
Light (`moonshine-base`)	English only	~155 MB	No	Short English clips, a bit more accurate

The two Moonshine models (Useful Sensors, MIT) are an ultra-light option built for on-device English speech. They return plain text only — no timestamps, so no SRT/VTT — and are meant for short clips rather than long recordings. For Japanese, or when you need subtitles or long-form audio, use a Whisper model.

Because the model executes locally:

Your audio never leaves your computer — nothing is sent to a server.
After the first download, the model is cached and works offline.
WebGPU browsers (recent Chrome, Edge) run much faster than the CPU (WebAssembly) fallback.

Steps

Drop an audio file onto the upload area (or click to choose one).
Pick a model — Balanced is a good multilingual default; use Accurate for the best Japanese quality, or Fast for quick English.
For multilingual models, choose the language (or leave it on Auto-detect).
Click Transcribe. On the first run of each model, the browser downloads it — you'll see a progress percentage.
When it finishes, switch between Text, Timestamped, SRT, and VTT.
Copy or Download the format you need.

Example: upload a 10-minute interview recording (interview.m4a) → download interview.srt, a subtitle file you can load straight into a video editor.

Output formats

Format	Contains	Best for
Text	Plain transcript, no timings	Notes, articles, copy-paste
Timestamped	`[start → end] text` per segment	Skimming, meeting minutes, quoting
SRT	Numbered subtitle cues with `,` millisecond separator	Video editors, most players
VTT	WebVTT cues with `.` millisecond separator	HTML5 `<track>`, web video

When to use this vs. a server tool

Situation	Best choice
Sensitive or private recordings	This tool — the audio never leaves your browser
No account / no upload wanted	This tool — fully client-side, free
Subtitles for a video	This tool — export SRT or VTT directly
Hundreds of hours, automated pipeline	A server/API tool — batch throughput beyond one browser

Tips for the best transcript

Clear speech and low background noise transcribe most accurately.
For Japanese or mixed-language audio, prefer the Accurate model and set the language explicitly.
If the first run feels slow, that's the one-time model download; the next file is much faster.
Long files take longer because audio is processed in 30-second chunks — a WebGPU browser helps a lot here.

Everything here runs in your browser. Your audio is never uploaded — that's the whole point.

Transcribe audio to text

Transcribe audio to text in your browser

How it works

Steps

Output formats

When to use this vs. a server tool

Tips for the best transcript

FAQ

Get in touch

Thanks for reaching out

What we can help with

Talk to us online

Transcribe audio to text

Transcribe audio to text in your browser

How it works

Steps

Output formats

When to use this vs. a server tool

Tips for the best transcript

FAQ

Is my audio uploaded to a server?

What audio formats can I transcribe?

Which model should I pick?

Can it output subtitles (SRT / VTT)?

Why is the first run slow?

Is it free, and can I use the results commercially?

How accurate is it?

Get in touch

Thanks for reaching out

What we can help with

Talk to us online