Skip to content
Runs local · no upload

Audio Transcription — Speech to Text

Turn spoken words into editable text — entirely in your browser, no server uploads, no account required.

How It Works

  1. 01

    Paste text or code

    Paste your content into the input field or type directly.

  2. 02

    Instant processing

    The tool processes your content immediately and shows the result.

  3. 03

    Copy result

    Copy the result to your clipboard with one click.

Privacy

All calculations run directly in your browser. No data is sent to any server.

Paste or drop an audio file and get a text transcript in seconds. The transcription engine runs on-device using WebAssembly-powered ML, so your audio never leaves your computer.

01 — How to Use

How do you use this tool?

  1. Click the upload area or drag and drop an audio file (MP3, WAV, M4A, OGG, WebM supported).
  2. Select the spoken language if auto-detection isn't accurate enough.
  3. Click Transcribe and wait while the model processes your file locally.
  4. Review the transcript in the output panel — edit inline if needed.
  5. Copy the text or download it as a .txt file.

What This Tool Does

This tool converts the spoken content of an audio file into a plain-text transcript — without uploading anything to a server. It uses a compact speech-recognition model compiled to WebAssembly, which runs directly inside your browser tab. You get the full transcript in a scrollable, editable text panel that you can copy or download.

Supported input formats include MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — the most common formats produced by phones, voice recorders, video editors, and meeting apps.

How Does It Work?

The engine is a modern speech-recognition model distilled for browser inference:

Audio file → decode to PCM → chunk into 30-second windows
  → mel-spectrogram → encoder (transformer) → decoder (beam search)
  → token sequence → detokenize → plain text

Each 30-second window is processed sequentially. Timestamps are computed relative to the chunk boundary, so the output order matches the original recording timeline. No cloud API is involved at any stage.

How Does Language Detection Work?

The model auto-detects the spoken language from the first 30 seconds of audio. If detection is wrong — common with short clips or heavy accents — use the language dropdown to force a specific language before transcribing.

SettingWhen to Use
Auto-detectMonolingual recordings ≥ 30 sec
Force languageShort clips, strong regional accents
English (US)Podcasts, meetings, dictation
English (UK/AU)British/Australian accented content

What Are Common Use Cases?

Meeting notes. Drop in a recorded Zoom or Teams call and get a rough transcript to clean up into meeting minutes. Most 1-hour meetings produce a 5,000–8,000 word transcript.

Podcast show notes. Transcribe an episode to pull quotes, create timestamps, or generate an SEO-friendly episode description.

Video captions. Extract dialogue from a video’s audio track, then format it as SRT subtitle entries. Combine with a free video editor to add closed captions.

Dictation clean-up. Voice memos recorded on iPhone or Android can be transcribed and then edited without retyping.

Academic research. Qualitative researchers can transcribe interview recordings without sending sensitive data to a third-party transcription service.

Frequently Asked Questions

Does my audio get sent to a server? No. The transcription model runs entirely inside your browser using WebAssembly. Your audio file never leaves your device. There is no backend, no API key, and no logging. This makes it suitable for confidential recordings — medical, legal, or personal.

Which audio formats are supported? MP3, WAV, M4A, OGG, and WebM. These cover the output of virtually every phone voice recorder, desktop DAW, and video conferencing app. If you have an exotic format (FLAC, AIFF, WMA), convert it to MP3 at 128 kbps first — audio quality above 192 kbps doesn’t improve transcription accuracy.

How long does transcription take? Transcription speed depends on your device’s CPU (and GPU if WebGPU is available). On a modern laptop, a 5-minute recording typically finishes in 1–2 minutes. On a phone or older hardware, expect 4–6 minutes for the same file.

How accurate is the transcription? For clear, close-mic English speech in a quiet environment, word error rates typically fall between 5–10%. Background music, strong accents, technical jargon, and overlapping speakers all increase error rates. Always proofread before publishing.

Can I transcribe multiple speakers? The current engine outputs a single continuous transcript without speaker labels. After copying the text, you can manually insert speaker names based on context or use a separate diarization tool.

Is there a file size limit? There’s no hard cap, but in-browser processing of very large files (over 200 MB) can exhaust browser memory on older devices. For recordings over 2 hours, splitting into 30-minute segments before uploading is strongly recommended.

Last updated:

You might also like