Skip to content

Instantly share code, notes, and snippets.

@renoirb
Last active February 22, 2026 04:59
Show Gist options
  • Select an option

  • Save renoirb/9610a93371a294afebc20cb8698825dc to your computer and use it in GitHub Desktop.

Select an option

Save renoirb/9610a93371a294afebc20cb8698825dc to your computer and use it in GitHub Desktop.
Voice note to text transcription using faster-whisper
# Faster-Whisper is currently well supported only on 3.11
PYTHON=python3.11
venv/
__pycache__/
*.pyc
# Machine specific
.env
# Audio files and transcription output
*.m4a
*.wav
*.txt
*.md
!README.md
!CLAUDE.md
# Archives
*.zip

Audio and Video Transcript Tools

Local transcription and transcript formatting tools. Two paths to get timestamped markdown from audio/video:

  1. yt-dlp path — extract auto-captions from YouTube, format into timestamped markdown
  2. faster-whisper path — transcribe audio locally using faster-whisper, format into timestamped markdown

Both paths produce the same output format: markdown with [HH:MM:SS](url?t=...) linked timestamps every ~40 seconds, with chapter markers from video metadata.

Setup

Requires Python 3.11+ and ffmpeg. The venv is only needed for the faster-whisper path.

# macOS prerequisites
brew install python@3.11 ffmpeg yt-dlp

# For faster-whisper (clone this gist, then):
./setup.sh

# If your default python3 is too new (3.14+), point to 3.11:
PYTHON=python3.11 ./setup.sh

Scripts

File Purpose
yt-json3-to-markdown.py Convert yt-dlp JSON3 captions + metadata into timestamped markdown
whisper-json-to-markdown.py Convert faster-whisper JSON segments + metadata into timestamped markdown
transcribe.py Full-featured faster-whisper transcription with --format json for timestamps
transcribe_faster.py Quick transcription script (plain text only)
setup.sh One-command venv creation and dependency install

Usage: yt-dlp path (no venv needed)

# 1. Get metadata
yt-dlp --print-json --skip-download "VIDEO_URL" > metadata.json

# 2. Get captions
yt-dlp --write-auto-sub --sub-lang en --sub-format json3 --skip-download "VIDEO_URL"
# produces VIDEO_ID.en.json3

# 3. Format into markdown
python3 yt-json3-to-markdown.py VIDEO_ID.en.json3 metadata.json transcript.md

Usage: faster-whisper path

source venv/bin/activate

# 1. Transcribe with timestamps (produces whisper-segments.json)
python transcribe.py audio.opus --model medium --language en --format json --output whisper-segments.json

# 2. Format into markdown (needs metadata.json from yt-dlp)
python whisper-json-to-markdown.py whisper-segments.json metadata.json transcript.md

deactivate

Whisper models

Model Speed Quality Use case
tiny Fastest Lower Quick preview, check if audio is usable
base Fast Good Default — clear audio, native speakers
small Medium Better Accented speech, some background noise
medium Slow High Recommended for accuracy — poor audio, multiple speakers
large-v3 Slowest Best Critical transcriptions, heavy accents, noisy environments

Tip: Start with base to preview, then re-run with medium or large-v3 if quality is insufficient. Each step up roughly doubles processing time.

Troubleshooting

"ffmpeg not found"brew install ffmpeg

"No module named 'faster_whisper'" — Activate the venv first: source venv/bin/activate

Poor quality — Try a larger model: --model medium or --model large-v3

Python too new — faster-whisper needs 3.11 or 3.12. Use PYTHON=python3.11 ./setup.sh

Voice Note to Text

Transcribe voice recordings (m4a/wav/mp3) to text locally using faster-whisper on CPU. No cloud APIs needed.

Use Case

Record voice notes on phone, transfer to computer, run transcription to get text for editing in Obsidian or elsewhere.

Stack

  • Python 3.11 (pinned via .env, required for ML package compatibility)
  • faster-whisper — CTranslate2-based whisper, 3-4x faster than OpenAI's original
  • ffmpeg — system dependency for audio decoding

Setup

cp .env.example .env   # set PYTHON version if needed
./setup.sh             # creates venv, installs deps

Usage

source venv/bin/activate
python transcribe_faster.py recording.m4a              # quick, base model
python transcribe.py recording.m4a --model medium       # better quality

Output is always <input>.txt next to the input file (e.g. recording.m4a produces recording.txt). Override with --output path.txt in transcribe.py.

Key Files

File Role
setup.sh Creates venv, installs requirements
.env / .env.example Machine-specific Python version (gitignored / template)
requirements.txt faster-whisper==1.2.1
transcribe.py Argparse script with --model, --language, --output
transcribe_faster.py Minimal script with positional args

What's Ignored

venv/, .env, *.m4a, *.wav, *.txt, *.zip, __pycache__/ — see .gitignore.

faster-whisper==1.2.1
#!/bin/bash
set -e
PYTHON="${PYTHON:-python3}"
VENV_DIR="venv"
# Load machine-specific config (copy .env.example to .env)
if [ -f .env ]; then
source .env
else
echo "No .env file to read preferences, copy from .env.example"
exit 1
fi
# Check prerequisites
if ! command -v "$PYTHON" &>/dev/null; then
echo "Error: $PYTHON not found. Install Python 3.11+ first."
echo " macOS: brew install python@3.11"
echo " Linux: sudo apt-get install python3 python3-venv"
exit 1
fi
if ! command -v ffmpeg &>/dev/null; then
echo "Error: ffmpeg not found. Required for audio processing."
echo " macOS: brew install ffmpeg"
echo " Linux: sudo apt-get install ffmpeg"
exit 1
fi
# Show Python version
echo "Using: $($PYTHON --version)"
# Create venv
if [ -d "$VENV_DIR" ]; then
echo "venv already exists. Delete it first to recreate:"
echo " rm -rf $VENV_DIR && ./setup.sh"
exit 1
fi
echo "Creating virtual environment..."
"$PYTHON" -m venv "$VENV_DIR"
echo "Installing dependencies..."
"$VENV_DIR/bin/pip" install --upgrade pip --quiet
"$VENV_DIR/bin/pip" install -r requirements.txt
# Make scripts executable
chmod +x transcribe.py transcribe_faster.py
echo ""
echo "Setup complete. Usage:"
echo ""
echo " source venv/bin/activate"
echo " python transcribe_faster.py your-audio.m4a"
echo ""
echo "Or specify a model (tiny/base/small/medium/large-v3):"
echo ""
echo " python transcribe.py your-audio.m4a --model medium"
#!/usr/bin/env python3
import argparse
import json
from pathlib import Path
from faster_whisper import WhisperModel
def main():
parser = argparse.ArgumentParser(description='Transcribe audio using Faster-Whisper')
parser.add_argument(
'input',
type=str,
help='Input audio file',
)
parser.add_argument(
'--model',
type=str,
default='base',
choices=['tiny', 'base', 'small', 'medium', 'large-v2', 'large-v3'],
help='Model size (default: base)',
)
parser.add_argument(
'--language',
type=str,
default='en',
help='Language code (default: en)',
)
parser.add_argument(
'--format',
type=str,
default='text',
choices=['text', 'json'],
help='Output format: text (plain transcript) or json (segments with timestamps)',
)
parser.add_argument(
'--output',
type=str,
help='Output file (default: input.txt or input.json based on format)',
)
args = parser.parse_args()
# Initialize model
print(f"Loading {args.model} model...", flush=True)
model = WhisperModel(
args.model,
device="cpu",
compute_type="int8",
)
# Transcribe
print(f"Transcribing {args.input}...", flush=True)
segments, info = model.transcribe(
args.input,
language=args.language,
)
# Determine output file
if args.output:
output_path = Path(args.output)
else:
suffix = '.json' if args.format == 'json' else '.txt'
output_path = Path(args.input).with_suffix(suffix)
# Collect segments (iterator can only be consumed once)
segment_list = []
for segment in segments:
segment_list.append({
'start': segment.start,
'end': segment.end,
'text': segment.text.strip(),
})
# Write output
with open(output_path, 'w', encoding='utf-8') as f:
if args.format == 'json':
json.dump({
'language': info.language,
'language_probability': info.language_probability,
'segments': segment_list,
}, f, indent=2, ensure_ascii=False)
else:
for seg in segment_list:
f.write(seg['text'] + '\n')
print(f"\nTranscription saved to: {output_path}")
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
print(f"Segments: {len(segment_list)}")
if __name__ == '__main__':
main()
#!/usr/bin/env python3
import sys
from pathlib import Path
from faster_whisper import WhisperModel
if len(sys.argv) < 2:
print("Usage: python transcribe_faster.py <audio_file> [model] [language]")
print("\nExamples:")
print(" python transcribe_faster.py recording.m4a")
print(" python transcribe_faster.py recording.m4a medium en")
sys.exit(1)
input_file = sys.argv[1]
model_size = sys.argv[2] if len(sys.argv) > 2 else "base"
language = sys.argv[3] if len(sys.argv) > 3 else "en"
# Initialize model
print(f"Loading {model_size} model...")
model = WhisperModel(
model_size,
device="cpu",
compute_type="int8",
)
# Transcribe
print(f"Transcribing {input_file}...")
segments, info = model.transcribe(
input_file,
language=language,
)
# Output file
output_file = Path(input_file).with_suffix('.txt')
# Write transcription
with open(output_file, 'w', encoding='utf-8') as f:
for segment in segments:
f.write(segment.text.strip() + '\n')
print(f"\nTranscription saved to: {output_file}")
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
#!/usr/bin/env python3
"""Process faster-whisper JSON segments into timestamped markdown transcript.
Usage:
python3 whisper-json-to-markdown.py <whisper_json> <metadata_json> <output_file>
The whisper JSON is produced by: transcribe.py <audio> --format json
The metadata JSON is produced by: yt-dlp --print-json --skip-download <URL>
"""
import json
import sys
PARAGRAPH_INTERVAL_S = 40 # ~40 seconds between timestamp markers
def s_to_hms(s):
"""Convert seconds to HH:MM:SS string."""
total_seconds = int(s)
hours = total_seconds // 3600
minutes = (total_seconds % 3600) // 60
seconds = total_seconds % 60
return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
def s_to_url(base_url, s):
"""Create YouTube URL with timestamp."""
total_seconds = int(s)
if total_seconds == 0:
return base_url
minutes = total_seconds // 60
seconds = total_seconds % 60
if minutes > 0:
return f"{base_url}?t={minutes}m{seconds}s"
return f"{base_url}?t={seconds}s"
def get_chapter_at(s, chapters):
"""Return chapter title if this timestamp starts a chapter, else None."""
for chapter_seconds, title in chapters:
if abs(s - chapter_seconds) < 3:
return title
return None
def process_whisper(whisper_path, metadata_path, output_path):
# Load metadata
with open(metadata_path) as f:
meta = json.load(f)
video_id = meta.get("id", "unknown")
title = meta.get("title", "Untitled")
channel = meta.get("channel", meta.get("uploader", "Unknown"))
duration_str = meta.get("duration_string", "Unknown")
base_url = f"https://youtu.be/{video_id}"
# Extract chapters from metadata
chapters = []
for ch in (meta.get("chapters") or []):
chapters.append((ch["start_time"], ch["title"]))
# Load whisper segments
with open(whisper_path) as f:
data = json.load(f)
segments = data.get("segments", [])
if not segments:
print("ERROR: No segments found", file=sys.stderr)
sys.exit(1)
# Build output
lines = []
lines.append(f"# {title}")
lines.append("")
lines.append(f"**Channel:** {channel}")
lines.append(f"**Video:** [{base_url}]({base_url})")
lines.append(f"**Duration:** {duration_str}")
lines.append(f"**Transcription:** faster-whisper (language: {data.get('language', 'unknown')}, "
f"confidence: {data.get('language_probability', 0):.0%})")
lines.append("")
lines.append("---")
lines.append("")
current_paragraph_texts = []
paragraph_start_s = 0
last_timestamp_s = 0
chapter_inserted = set()
for seg in segments:
start_s = seg["start"]
text = seg["text"]
# Check for chapter marker
chapter = get_chapter_at(start_s, chapters)
if chapter and chapter not in chapter_inserted:
# Flush current paragraph
if current_paragraph_texts:
ts_str = s_to_hms(paragraph_start_s)
url = s_to_url(base_url, paragraph_start_s)
para = " ".join(current_paragraph_texts).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
current_paragraph_texts = []
chapter_inserted.add(chapter)
lines.append(f"## {chapter}")
lines.append("")
paragraph_start_s = start_s
last_timestamp_s = start_s
# Start new paragraph if enough time has passed
if start_s - last_timestamp_s >= PARAGRAPH_INTERVAL_S and current_paragraph_texts:
ts_str = s_to_hms(paragraph_start_s)
url = s_to_url(base_url, paragraph_start_s)
para = " ".join(current_paragraph_texts).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
current_paragraph_texts = []
paragraph_start_s = start_s
last_timestamp_s = start_s
if not current_paragraph_texts:
paragraph_start_s = start_s
last_timestamp_s = start_s
current_paragraph_texts.append(text)
# Flush final paragraph
if current_paragraph_texts:
ts_str = s_to_hms(paragraph_start_s)
url = s_to_url(base_url, paragraph_start_s)
para = " ".join(current_paragraph_texts).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
output = "\n".join(lines)
with open(output_path, "w") as f:
f.write(output)
total_paragraphs = sum(1 for l in lines if l.startswith("["))
total_chapters = sum(1 for l in lines if l.startswith("## "))
last_s = segments[-1]["end"] if segments else 0
print(f"Processed {len(segments)} whisper segments")
print(f"Generated {total_paragraphs} timestamped paragraphs")
print(f"Inserted {total_chapters} chapter markers")
print(f"Coverage: 00:00:00 to {s_to_hms(last_s)}")
if __name__ == "__main__":
if len(sys.argv) != 4:
print(f"Usage: {sys.argv[0]} <whisper_json> <metadata_json> <output_file>")
sys.exit(1)
process_whisper(sys.argv[1], sys.argv[2], sys.argv[3])
#!/usr/bin/env python3
"""Process YouTube JSON3 captions into timestamped markdown transcript.
Usage:
python3 yt-json3-to-markdown.py <json3_file> <metadata_json_file> <output_file>
The metadata JSON is produced by: yt-dlp --print-json --skip-download <URL>
The json3 file is produced by: yt-dlp --write-auto-sub --sub-lang en --sub-format json3 --skip-download <URL>
"""
import json
import sys
PARAGRAPH_INTERVAL_MS = 40000 # ~40 seconds between timestamp markers
def ms_to_hms(ms):
"""Convert milliseconds to HH:MM:SS string."""
total_seconds = int(ms / 1000)
hours = total_seconds // 3600
minutes = (total_seconds % 3600) // 60
seconds = total_seconds % 60
return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
def ms_to_url(base_url, ms):
"""Create YouTube URL with timestamp."""
total_seconds = int(ms / 1000)
if total_seconds == 0:
return base_url
minutes = total_seconds // 60
seconds = total_seconds % 60
if minutes > 0:
return f"{base_url}?t={minutes}m{seconds}s"
return f"{base_url}?t={seconds}s"
def get_chapter_at(ms, chapters):
"""Return chapter title if this timestamp starts a chapter, else None."""
ts_seconds = ms / 1000
for chapter_seconds, title in chapters:
if abs(ts_seconds - chapter_seconds) < 3:
return title
return None
def process_json3(json3_path, metadata_path, output_path):
# Load metadata
with open(metadata_path) as f:
meta = json.load(f)
video_id = meta.get("id", "unknown")
title = meta.get("title", "Untitled")
channel = meta.get("channel", meta.get("uploader", "Unknown"))
duration_str = meta.get("duration_string", "Unknown")
base_url = f"https://youtu.be/{video_id}"
# Extract chapters from metadata
chapters = []
for ch in (meta.get("chapters") or []):
chapters.append((ch["start_time"], ch["title"]))
# Load json3 captions
with open(json3_path) as f:
data = json.load(f)
events = data.get("events", [])
# Extract text segments with absolute timestamps
segments = []
for event in events:
if "segs" not in event:
continue
base_ms = event.get("tStartMs", 0)
is_first_in_event = True
for seg in event["segs"]:
text = seg.get("utf8", "")
if text == "\n":
continue
if is_first_in_event and segments and not text.startswith(" "):
text = " " + text
is_first_in_event = False
offset = seg.get("tOffsetMs", 0)
abs_ms = base_ms + offset
segments.append((abs_ms, text))
if not segments:
print("ERROR: No text segments found", file=sys.stderr)
sys.exit(1)
# Build output
lines = []
lines.append(f"# {title}")
lines.append("")
lines.append(f"**Channel:** {channel}")
lines.append(f"**Video:** [{base_url}]({base_url})")
lines.append(f"**Duration:** {duration_str}")
lines.append("")
lines.append("---")
lines.append("")
current_paragraph_text = []
paragraph_start_ms = 0
last_timestamp_ms = 0
chapter_inserted = set()
for i, (abs_ms, text) in enumerate(segments):
chapter = get_chapter_at(abs_ms, chapters)
if chapter and chapter not in chapter_inserted:
if current_paragraph_text:
ts_str = ms_to_hms(paragraph_start_ms)
url = ms_to_url(base_url, paragraph_start_ms)
para = "".join(current_paragraph_text).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
current_paragraph_text = []
chapter_inserted.add(chapter)
for cs, ct in chapters:
if ct == chapter:
ch_ms = cs * 1000
break
lines.append(f"## {chapter}")
lines.append("")
paragraph_start_ms = abs_ms
last_timestamp_ms = abs_ms
if abs_ms - last_timestamp_ms >= PARAGRAPH_INTERVAL_MS and current_paragraph_text:
ts_str = ms_to_hms(paragraph_start_ms)
url = ms_to_url(base_url, paragraph_start_ms)
para = "".join(current_paragraph_text).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
current_paragraph_text = []
paragraph_start_ms = abs_ms
last_timestamp_ms = abs_ms
if not current_paragraph_text:
paragraph_start_ms = abs_ms
last_timestamp_ms = abs_ms
current_paragraph_text.append(text)
if current_paragraph_text:
ts_str = ms_to_hms(paragraph_start_ms)
url = ms_to_url(base_url, paragraph_start_ms)
para = "".join(current_paragraph_text).strip()
if para:
lines.append(f"[{ts_str}]({url}) {para}")
lines.append("")
output = "\n".join(lines)
with open(output_path, "w") as f:
f.write(output)
total_paragraphs = sum(1 for l in lines if l.startswith("["))
total_chapters = sum(1 for l in lines if l.startswith("## "))
last_ms = segments[-1][0] if segments else 0
print(f"Processed {len(segments)} text segments")
print(f"Generated {total_paragraphs} timestamped paragraphs")
print(f"Inserted {total_chapters} chapter markers")
print(f"Coverage: 00:00:00 to {ms_to_hms(last_ms)}")
if __name__ == "__main__":
if len(sys.argv) != 4:
print(f"Usage: {sys.argv[0]} <json3_file> <metadata_json> <output_file>")
sys.exit(1)
process_json3(sys.argv[1], sys.argv[2], sys.argv[3])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment