Local transcription and transcript formatting tools. Two paths to get timestamped markdown from audio/video:
- yt-dlp path — extract auto-captions from YouTube, format into timestamped markdown
- faster-whisper path — transcribe audio locally using faster-whisper, format into timestamped markdown
Both paths produce the same output format: markdown with [HH:MM:SS](url?t=...) linked timestamps every ~40 seconds, with chapter markers from video metadata.
Requires Python 3.11+ and ffmpeg. The venv is only needed for the faster-whisper path.
# macOS prerequisites
brew install python@3.11 ffmpeg yt-dlp
# For faster-whisper (clone this gist, then):
./setup.sh
# If your default python3 is too new (3.14+), point to 3.11:
PYTHON=python3.11 ./setup.sh| File | Purpose |
|---|---|
yt-json3-to-markdown.py |
Convert yt-dlp JSON3 captions + metadata into timestamped markdown |
whisper-json-to-markdown.py |
Convert faster-whisper JSON segments + metadata into timestamped markdown |
transcribe.py |
Full-featured faster-whisper transcription with --format json for timestamps |
transcribe_faster.py |
Quick transcription script (plain text only) |
setup.sh |
One-command venv creation and dependency install |
# 1. Get metadata
yt-dlp --print-json --skip-download "VIDEO_URL" > metadata.json
# 2. Get captions
yt-dlp --write-auto-sub --sub-lang en --sub-format json3 --skip-download "VIDEO_URL"
# produces VIDEO_ID.en.json3
# 3. Format into markdown
python3 yt-json3-to-markdown.py VIDEO_ID.en.json3 metadata.json transcript.mdsource venv/bin/activate
# 1. Transcribe with timestamps (produces whisper-segments.json)
python transcribe.py audio.opus --model medium --language en --format json --output whisper-segments.json
# 2. Format into markdown (needs metadata.json from yt-dlp)
python whisper-json-to-markdown.py whisper-segments.json metadata.json transcript.md
deactivate| Model | Speed | Quality | Use case |
|---|---|---|---|
tiny |
Fastest | Lower | Quick preview, check if audio is usable |
base |
Fast | Good | Default — clear audio, native speakers |
small |
Medium | Better | Accented speech, some background noise |
medium |
Slow | High | Recommended for accuracy — poor audio, multiple speakers |
large-v3 |
Slowest | Best | Critical transcriptions, heavy accents, noisy environments |
Tip: Start with base to preview, then re-run with medium or large-v3 if quality is insufficient. Each step up roughly doubles processing time.
"ffmpeg not found" — brew install ffmpeg
"No module named 'faster_whisper'" — Activate the venv first: source venv/bin/activate
Poor quality — Try a larger model: --model medium or --model large-v3
Python too new — faster-whisper needs 3.11 or 3.12. Use PYTHON=python3.11 ./setup.sh