Audio Transcription — Speech to Text

What This Tool Does

This tool converts the spoken content of an audio file into a plain-text transcript — without uploading anything to a server. It uses a compact speech-recognition model compiled to WebAssembly, which runs directly inside your browser tab. You get the full transcript in a scrollable, editable text panel that you can copy or download.

Supported input formats include MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — the most common formats produced by phones, voice recorders, video editors, and meeting apps.

How Does It Work?

The engine is a modern speech-recognition model distilled for browser inference:

Audio file → decode to PCM → chunk into 30-second windows
  → mel-spectrogram → encoder (transformer) → decoder (beam search)
  → token sequence → detokenize → plain text

Each 30-second window is processed sequentially. Timestamps are computed relative to the chunk boundary, so the output order matches the original recording timeline. No cloud API is involved at any stage.

How Does Language Detection Work?

The model auto-detects the spoken language from the first 30 seconds of audio. If detection is wrong — common with short clips or heavy accents — use the language dropdown to force a specific language before transcribing.

Setting	When to Use
Auto-detect	Monolingual recordings ≥ 30 sec
Force language	Short clips, strong regional accents
English (US)	Podcasts, meetings, dictation
English (UK/AU)	British/Australian accented content

What Are Common Use Cases?

Meeting notes. Drop in a recorded Zoom or Teams call and get a rough transcript to clean up into meeting minutes. Most 1-hour meetings produce a 5,000–8,000 word transcript.

Podcast show notes. Transcribe an episode to pull quotes, create timestamps, or generate an SEO-friendly episode description.

Video captions. Extract dialogue from a video’s audio track, then format it as SRT subtitle entries. Combine with a free video editor to add closed captions.

Dictation clean-up. Voice memos recorded on iPhone or Android can be transcribed and then edited without retyping.

Academic research. Qualitative researchers can transcribe interview recordings without sending sensitive data to a third-party transcription service.

Frequently Asked Questions

Does my audio get sent to a server? No. The transcription model runs entirely inside your browser using WebAssembly. Your audio file never leaves your device. There is no backend, no API key, and no logging. This makes it suitable for confidential recordings — medical, legal, or personal.

Which audio formats are supported? MP3, WAV, M4A, OGG, and WebM. These cover the output of virtually every phone voice recorder, desktop DAW, and video conferencing app. If you have an exotic format (FLAC, AIFF, WMA), convert it to MP3 at 128 kbps first — audio quality above 192 kbps doesn’t improve transcription accuracy.

How long does transcription take? Transcription speed depends on your device’s CPU (and GPU if WebGPU is available). On a modern laptop, a 5-minute recording typically finishes in 1–2 minutes. On a phone or older hardware, expect 4–6 minutes for the same file.

How accurate is the transcription? For clear, close-mic English speech in a quiet environment, word error rates typically fall between 5–10%. Background music, strong accents, technical jargon, and overlapping speakers all increase error rates. Always proofread before publishing.

Can I transcribe multiple speakers? The current engine outputs a single continuous transcript without speaker labels. After copying the text, you can manually insert speaker names based on context or use a separate diarization tool.

Is there a file size limit? There’s no hard cap, but in-browser processing of very large files (over 200 MB) can exhaust browser memory on older devices. For recordings over 2 hours, splitting into 30-minute segments before uploading is strongly recommended.

Audio Transcription — Speech to Text

How It Works

Paste text or code

Instant processing

Copy result

Privacy

How do you use this tool?