Skip to content

Instantly share code, notes, and snippets.

@romilly
Created March 15, 2026 15:31
Show Gist options
  • Select an option

  • Save romilly/f8243d5948e54fd52c1ebf9dd283310e to your computer and use it in GitHub Desktop.

Select an option

Save romilly/f8243d5948e54fd52c1ebf9dd283310e to your computer and use it in GitHub Desktop.
Sert up an Audio Speaker/Listener comunicating via MQTT

Voice Services on Raspberry Pi 5 (Trixie)

Wake word detection, speech-to-text, and text-to-speech as independent MQTT-connected services on a Raspberry Pi 5 (8 GB) running Raspberry Pi OS Trixie (Debian 13, 64-bit).

Architecture

Two independent services communicate via MQTT:

  • STT service: microphone → openwakeword (wake word) → faster-whisper (transcribe) → MQTT publish to voice/transcript
  • TTS service: MQTT subscribe to voice/speak → Piper TTS → speaker

Openwakeword runs in Docker (to avoid tflite-runtime/Python 3.12+ incompatibility). Piper and faster-whisper run as Python libraries in a shared venv. MQTT broker (Mosquitto) runs on a separate machine (edgy-hermes).

Prerequisites

  • Raspberry Pi 5 (8 GB) running Pi OS Trixie 64-bit
  • USB microphone (tested with USB PnP Sound Device)
  • USB speaker (tested with UACDemoV1.0)
  • Mosquitto MQTT broker accessible on the network
  • Internet connection for initial setup

Installation

1. Install Docker

sudo apt update && sudo apt install -y ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo tee /etc/apt/keyrings/docker.asc > /dev/null
sudo chmod a+r /etc/apt/keyrings/docker.asc

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian trixie stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

sudo usermod -aG docker $USER

Reboot, then verify with docker run hello-world.

2. Start openwakeword

docker run -d --restart unless-stopped \
  --name openwakeword \
  -p 10400:10400 \
  rhasspy/wyoming-openwakeword \
  --preload-model 'ok_nabu'

Verify with docker logs openwakeword.

3. Create the Python project

mkdir -p ~/git/active/voice-services
cd ~/git/active/voice-services
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install piper-tts faster-whisper wyoming paho-mqtt sounddevice pathvalidate

4. Install system audio library

sudo apt install -y libportaudio2

5. Download the Piper voice model

cd ~/git/active/voice-services
mkdir -p piper-data
cd piper-data
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
cd ..

Verify the model loads:

source venv/bin/activate
python -c "
from piper.voice import PiperVoice
voice = PiperVoice.load('piper-data/en_US-lessac-medium.onnx')
print(f'Sample rate: {voice.config.sample_rate}')
"

6. Download the Whisper model

source venv/bin/activate
python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base.en', device='cpu', compute_type='int8')
print('Whisper loaded OK')
"

This downloads the model on first run. Use tiny.en for faster but less accurate results on a 4 GB Pi.

Calibration

Find your audio devices

python -c "
import sounddevice as sd
print(sd.query_devices())
"

Note the index numbers for your microphone (input) and speaker (output). Update MIC_DEVICE in stt_service.py and OUTPUT_DEVICE in tts_service.py if needed.

Find your mic's native sample rate

python -c "
import sounddevice as sd
info = sd.query_devices(0)  # replace 0 with your mic device index
print(f'Default sample rate: {info[\"default_samplerate\"]}')
"

Update MIC_RATE in stt_service.py if it's not 44100.

Calibrate silence threshold

python -c "
import numpy as np
import sounddevice as sd

print('Recording 3 seconds of silence...')
audio = sd.rec(44100 * 3, samplerate=44100, channels=1, dtype='int16', device=0)
sd.wait()
rms = np.sqrt(np.mean(audio.flatten().astype(np.float32) ** 2))
print(f'Silence RMS: {rms:.0f}')

print('Now speak for 3 seconds...')
audio = sd.rec(44100 * 3, samplerate=44100, channels=1, dtype='int16', device=0)
sd.wait()
rms = np.sqrt(np.mean(audio.flatten().astype(np.float32) ** 2))
print(f'Speech RMS: {rms:.0f}')
"

Set SILENCE_THRESHOLD in stt_service.py to a value between the two RMS readings.

The code

stt_service.py

"""STT Service: mic → openwakeword (wake word) → faster-whisper (transcribe) → MQTT publish.

Listens to the microphone continuously, streams audio to openwakeword for wake word detection.
On detection, records speech until silence, transcribes with faster-whisper, and publishes
the transcript to MQTT.

Configuration is via the constants below.
"""

import asyncio
import collections
import json
import time
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
from wyoming.client import AsyncClient
from wyoming.audio import AudioChunk
from wyoming.wake import Detect, Detection
import paho.mqtt.client as mqtt

# --- Configuration ---
WAKEWORD_URI = "tcp://localhost:10400"
WAKEWORD_NAME = "ok_nabu"

WHISPER_MODEL = "base.en"
WHISPER_DEVICE = "cpu"
WHISPER_COMPUTE_TYPE = "int8"

MQTT_BROKER = "edgy-hermes"
MQTT_PORT = 1883
MQTT_TOPIC_TRANSCRIPT = "voice/transcript"

MIC_DEVICE = 0  # USB PnP Sound Device
MIC_RATE = 44100  # native mic sample rate
TARGET_RATE = 16000  # rate needed by openwakeword and whisper
CHANNELS = 1
CHUNK_DURATION_MS = 100  # ms per audio chunk
MIC_CHUNK_SAMPLES = MIC_RATE * CHUNK_DURATION_MS // 1000

# Silence detection for end-of-speech
SILENCE_THRESHOLD = 20  # RMS threshold - adjust for your mic/environment
SILENCE_DURATION_S = 2.0  # seconds of silence to end recording
MAX_RECORDING_S = 10  # maximum recording duration

# Pre-buffer: keep this many seconds of audio before wake word detection
PRE_BUFFER_S = 1.5
PRE_BUFFER_CHUNKS = int(PRE_BUFFER_S * 1000 / CHUNK_DURATION_MS)


def downsample(audio: np.ndarray, from_rate: int, to_rate: int) -> np.ndarray:
    """Downsample audio using linear interpolation."""
    ratio = to_rate / from_rate
    num_samples = int(len(audio) * ratio)
    indices = np.arange(num_samples) / ratio
    return np.interp(indices, np.arange(len(audio)), audio.astype(np.float32)).astype(np.int16)


def record_chunk() -> tuple[np.ndarray, np.ndarray]:
    """Record a chunk from the mic. Returns (original 44100 Hz, downsampled 16000 Hz)."""
    audio = sd.rec(MIC_CHUNK_SAMPLES, samplerate=MIC_RATE, channels=CHANNELS,
                   dtype="int16", device=MIC_DEVICE)
    sd.wait()
    audio = audio.flatten()
    audio_16k = downsample(audio, MIC_RATE, TARGET_RATE)
    return audio, audio_16k


def rms(audio: np.ndarray) -> float:
    """Root mean square of audio samples."""
    return float(np.sqrt(np.mean(audio.astype(np.float32) ** 2)))


def record_until_silence() -> bytes:
    """Record audio from the mic until silence is detected. Returns 16kHz PCM bytes."""
    chunks = []
    silent_chunks = 0
    silence_chunks_needed = int(SILENCE_DURATION_S * 1000 / CHUNK_DURATION_MS)
    max_chunks = int(MAX_RECORDING_S * 1000 / CHUNK_DURATION_MS)

    print("  Recording...")
    for i in range(max_chunks):
        audio_44k, audio_16k = record_chunk()

        chunks.append(audio_16k.tobytes())

        if rms(audio_44k) < SILENCE_THRESHOLD:
            silent_chunks += 1
            if silent_chunks >= silence_chunks_needed:
                print("  Silence detected, stopping recording.")
                break
        else:
            silent_chunks = 0

    return b"".join(chunks)


async def run():
    # Initialise MQTT
    mqtt_client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)
    mqtt_client.connect(MQTT_BROKER, MQTT_PORT)
    mqtt_client.loop_start()
    print(f"MQTT connected to {MQTT_BROKER}:{MQTT_PORT}")

    # Load whisper model
    print(f"Loading whisper model '{WHISPER_MODEL}'...")
    model = WhisperModel(WHISPER_MODEL, device=WHISPER_DEVICE, compute_type=WHISPER_COMPUTE_TYPE)
    print("Whisper model loaded.")

    print(f"Connecting to openwakeword at {WAKEWORD_URI}...")
    async with AsyncClient.from_uri(WAKEWORD_URI) as wake_client:
        # Request detection
        await wake_client.write_event(Detect(names=[WAKEWORD_NAME]).event())
        print(f"Listening for wake word '{WAKEWORD_NAME}'...")

        # Rolling buffer of recent 16kHz audio chunks
        pre_buffer = collections.deque(maxlen=PRE_BUFFER_CHUNKS)

        while True:
            # Capture a chunk from the mic and downsample
            _, audio_16k = record_chunk()

            # Keep recent audio in rolling buffer
            pre_buffer.append(audio_16k.tobytes())

            # Send 16kHz audio to openwakeword
            chunk = AudioChunk(
                audio=audio_16k.tobytes(), rate=TARGET_RATE, width=2, channels=CHANNELS
            )
            await wake_client.write_event(chunk.event())

            # Check for detection (non-blocking)
            try:
                event = await asyncio.wait_for(wake_client.read_event(), timeout=0.01)
            except asyncio.TimeoutError:
                event = None

            if event and Detection.is_type(event.type):
                print("Wake word detected!")

                # Grab buffered audio from before detection
                buffered_audio = b"".join(pre_buffer)
                pre_buffer.clear()

                # Record speech until silence (returns 16kHz audio)
                new_audio = record_until_silence()

                # Combine pre-buffer with new recording
                pcm_audio = buffered_audio + new_audio

                # Transcribe
                print("  Transcribing...")
                audio_array = np.frombuffer(pcm_audio, dtype=np.int16).astype(np.float32) / 32768.0
                segments, info = model.transcribe(audio_array, beam_size=1, language="en")
                text = " ".join(seg.text.strip() for seg in segments).strip()

                if text:
                    print(f"  Transcript: {text}")
                    payload = json.dumps({
                        "text": text,
                        "timestamp": time.time(),
                    })
                    mqtt_client.publish(MQTT_TOPIC_TRANSCRIPT, payload)
                    print(f"  Published to {MQTT_TOPIC_TRANSCRIPT}")
                else:
                    print("  No speech detected.")

                # Resume wake word listening
                await wake_client.write_event(Detect(names=[WAKEWORD_NAME]).event())
                print(f"Listening for wake word '{WAKEWORD_NAME}'...")


if __name__ == "__main__":
    try:
        asyncio.run(run())
    except KeyboardInterrupt:
        print("\nStopping STT service.")

tts_service.py

"""TTS Service: MQTT subscribe → Piper TTS → audio playback.

Subscribes to an MQTT topic, synthesizes incoming text with Piper, and plays the audio.
Runs independently of the STT service.

Configuration is via the constants below.
"""

import io
import json
import wave
import numpy as np
import sounddevice as sd
import paho.mqtt.client as mqtt
from piper.voice import PiperVoice

# --- Configuration ---
MQTT_BROKER = "edgy-hermes"
MQTT_PORT = 1883
MQTT_TOPIC_SPEAK = "voice/speak"

PIPER_MODEL_PATH = "piper-data/en_US-lessac-medium.onnx"

# Set to None to use default output device
OUTPUT_DEVICE = None


def synthesize_and_play(voice: PiperVoice, text: str):
    """Synthesize text to audio and play it."""
    print(f"  Synthesizing: {text}")

    wav_buffer = io.BytesIO()
    with wave.open(wav_buffer, "wb") as wav_file:
        voice.synthesize_wav(text, wav_file)

    wav_buffer.seek(0)
    with wave.open(wav_buffer, "rb") as wav_file:
        sample_rate = wav_file.getframerate()
        frames = wav_file.readframes(wav_file.getnframes())

    audio = np.frombuffer(frames, dtype=np.int16).astype(np.float32)

    # Resample from model rate (22050) to 48000 for ALSA compatibility
    playback_rate = 48000
    resample_ratio = playback_rate / sample_rate
    num_samples = int(len(audio) * resample_ratio)
    indices = np.arange(num_samples) / resample_ratio
    audio_resampled = np.interp(indices, np.arange(len(audio)), audio).astype(np.int16)

    print(f"  Playing audio ({len(audio) / sample_rate:.1f}s)...")
    sd.play(audio_resampled, samplerate=playback_rate, device=OUTPUT_DEVICE)
    sd.wait()
    print("  Done.")


def run():
    print(f"Loading Piper voice from {PIPER_MODEL_PATH}...")
    voice = PiperVoice.load(PIPER_MODEL_PATH)
    print(f"Piper voice loaded (sample rate: {voice.config.sample_rate})")

    def on_connect(client, userdata, flags, reason_code, properties):
        print(f"MQTT connected to {MQTT_BROKER}:{MQTT_PORT}")
        client.subscribe(MQTT_TOPIC_SPEAK)
        print(f"Subscribed to {MQTT_TOPIC_SPEAK}")

    def on_message(client, userdata, msg):
        try:
            # Accept plain text or JSON with a "text" field
            try:
                payload = json.loads(msg.payload.decode())
                text = payload.get("text", "")
            except (json.JSONDecodeError, UnicodeDecodeError):
                text = msg.payload.decode()

            text = text.strip()
            if text:
                synthesize_and_play(voice, text)
        except Exception as e:
            print(f"  Error: {e}")

    mqtt_client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)
    mqtt_client.on_connect = on_connect
    mqtt_client.on_message = on_message
    mqtt_client.connect(MQTT_BROKER, MQTT_PORT)

    print("TTS service running. Waiting for messages...")
    mqtt_client.loop_forever()


if __name__ == "__main__":
    try:
        run()
    except KeyboardInterrupt:
        print("\nStopping TTS service.")

Running the services

In two separate terminals:

# Terminal 1 - STT service
cd ~/git/active/voice-services
source venv/bin/activate
python stt_service.py

# Terminal 2 - TTS service
cd ~/git/active/voice-services
source venv/bin/activate
python tts_service.py

Testing

Test TTS from any machine with mosquitto-clients:

mosquitto_pub -h edgy-hermes -t voice/speak -m "Hello Romilly"

Monitor STT output:

mosquitto_sub -h edgy-hermes -t voice/transcript

Then say "ok nabu" followed by a voice command. The transcript appears as JSON with text and timestamp fields.

Configuration reference

stt_service.py

Setting Default Description
WAKEWORD_URI tcp://localhost:10400 openwakeword Wyoming server
WAKEWORD_NAME ok_nabu Wake word to listen for. Built-in options: ok_nabu, hey_jarvis, alexa, hey_mycroft, hey_rhasspy
WHISPER_MODEL base.en Whisper model. tiny.en is faster, small.en more accurate
MQTT_BROKER edgy-hermes MQTT broker hostname
MQTT_TOPIC_TRANSCRIPT voice/transcript Topic to publish transcripts
MIC_DEVICE 0 Microphone device index
MIC_RATE 44100 Mic native sample rate
SILENCE_THRESHOLD 20 RMS below which audio counts as silence
SILENCE_DURATION_S 2.0 Seconds of silence to stop recording
MAX_RECORDING_S 10 Maximum recording duration
PRE_BUFFER_S 1.5 Seconds of audio kept before wake word detection

tts_service.py

Setting Default Description
MQTT_BROKER edgy-hermes MQTT broker hostname
MQTT_TOPIC_SPEAK voice/speak Topic to subscribe to. Accepts plain text or JSON {"text": "..."}
PIPER_MODEL_PATH piper-data/en_US-lessac-medium.onnx Path to Piper voice model
OUTPUT_DEVICE None (system default) Speaker device index

Gotchas discovered during setup

  • piper-tts is missing pathvalidate from its dependencies — install it manually.
  • piper-tts CLI model download doesn't work reliably — download .onnx and .onnx.json files directly from Hugging Face.
  • Use PiperVoice.load() and synthesize_wav() — not synthesize() (doesn't set WAV headers) and not synthesize_stream_raw() (doesn't exist in all versions).
  • USB mics often only support 44100 Hz, not 16000 Hz — record at the native rate and downsample.
  • ALSA rejects 22050 Hz playback on many USB speakers — resample to 48000 Hz.
  • openwakeword depends on tflite-runtime which doesn't support Python 3.12+ — run it in Docker.
  • Silence threshold must be calibrated to your mic. Measure RMS of silence and speech, set the threshold between them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment