Transcribe audio to text in your browser
This tool turns an audio file into text — plain transcript, timestamped lines, or ready-to-use subtitles (SRT / VTT) — using OpenAI's Whisper model running directly on your device. Drop in an MP3, WAV, M4A, OGG, FLAC, or WebM file and get the text back without uploading anything. Your audio never leaves your browser; only the AI model is downloaded (once) from a CDN, then everything runs locally.
How it works
The tool runs an open-source speech-recognition model — Whisper (OpenAI) or the lightweight Moonshine (Useful Sensors), both MIT-licensed — in your browser through Transformers.js, inside a Web Worker so the page never freezes. Your file is decoded and down-sampled to 16 kHz mono audio, split into 30-second chunks, and transcribed chunk by chunk. You pick the model that matches your language and quality needs:
| Model | Languages | First download | Subtitles | Best for |
|---|---|---|---|---|
Fast (whisper-tiny.en) | English only | ~120 MB | Yes | Quick English drafts, low-power devices |
Balanced (whisper-base) | Multilingual, incl. Japanese | ~200 MB | Yes | Everyday default |
Accurate (whisper-large-v3-turbo) | Multilingual, incl. Japanese | ~760 MB | Yes | Highest quality; WebGPU recommended |
Ultra-light (moonshine-tiny) | English only | ~75 MB | No | Short English clips, fastest, plain text only |
Light (moonshine-base) | English only | ~155 MB | No | Short English clips, a bit more accurate |
The two Moonshine models (Useful Sensors, MIT) are an ultra-light option built for on-device English speech. They return plain text only — no timestamps, so no SRT/VTT — and are meant for short clips rather than long recordings. For Japanese, or when you need subtitles or long-form audio, use a Whisper model.
Because the model executes locally:
- Your audio never leaves your computer — nothing is sent to a server.
- After the first download, the model is cached and works offline.
- WebGPU browsers (recent Chrome, Edge) run much faster than the CPU (WebAssembly) fallback.
Steps
- Drop an audio file onto the upload area (or click to choose one).
- Pick a model — Balanced is a good multilingual default; use Accurate for the best Japanese quality, or Fast for quick English.
- For multilingual models, choose the language (or leave it on Auto-detect).
- Click Transcribe. On the first run of each model, the browser downloads it — you'll see a progress percentage.
- When it finishes, switch between Text, Timestamped, SRT, and VTT.
- Copy or Download the format you need.
Example: upload a 10-minute interview recording (interview.m4a) → download interview.srt, a subtitle file you can load straight into a video editor.
Output formats
| Format | Contains | Best for |
|---|---|---|
| Text | Plain transcript, no timings | Notes, articles, copy-paste |
| Timestamped | [start → end] text per segment | Skimming, meeting minutes, quoting |
| SRT | Numbered subtitle cues with , millisecond separator | Video editors, most players |
| VTT | WebVTT cues with . millisecond separator | HTML5 <track>, web video |
When to use this vs. a server tool
| Situation | Best choice |
|---|---|
| Sensitive or private recordings | This tool — the audio never leaves your browser |
| No account / no upload wanted | This tool — fully client-side, free |
| Subtitles for a video | This tool — export SRT or VTT directly |
| Hundreds of hours, automated pipeline | A server/API tool — batch throughput beyond one browser |
Tips for the best transcript
- Clear speech and low background noise transcribe most accurately.
- For Japanese or mixed-language audio, prefer the Accurate model and set the language explicitly.
- If the first run feels slow, that's the one-time model download; the next file is much faster.
- Long files take longer because audio is processed in 30-second chunks — a WebGPU browser helps a lot here.
Everything here runs in your browser. Your audio is never uploaded — that's the whole point.
