Wake word detection, speech-to-text, and text-to-speech as independent MQTT-connected services on a Raspberry Pi 5 (8 GB) running Raspberry Pi OS Trixie (Debian 13, 64-bit).
Two independent services communicate via MQTT:
- STT service: microphone → openwakeword (wake word) → faster-whisper (transcribe) → MQTT publish to
voice/transcript - TTS service: MQTT subscribe to
voice/speak→ Piper TTS → speaker
Openwakeword runs in Docker (to avoid tflite-runtime/Python 3.12+ incompatibility).
Piper and faster-whisper run as Python libraries in a shared venv.
MQTT broker (Mosquitto) runs on a separate machine (edgy-hermes).
- Raspberry Pi 5 (8 GB) running Pi OS Trixie 64-bit
- USB microphone (tested with USB PnP Sound Device)
- USB speaker (tested with UACDemoV1.0)
- Mosquitto MQTT broker accessible on the network
- Internet connection for initial setup
sudo apt update && sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo tee /etc/apt/keyrings/docker.asc > /dev/null
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian trixie stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USERReboot, then verify with docker run hello-world.
docker run -d --restart unless-stopped \
--name openwakeword \
-p 10400:10400 \
rhasspy/wyoming-openwakeword \
--preload-model 'ok_nabu'Verify with docker logs openwakeword.
mkdir -p ~/git/active/voice-services
cd ~/git/active/voice-services
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install piper-tts faster-whisper wyoming paho-mqtt sounddevice pathvalidatesudo apt install -y libportaudio2cd ~/git/active/voice-services
mkdir -p piper-data
cd piper-data
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
cd ..Verify the model loads:
source venv/bin/activate
python -c "
from piper.voice import PiperVoice
voice = PiperVoice.load('piper-data/en_US-lessac-medium.onnx')
print(f'Sample rate: {voice.config.sample_rate}')
"source venv/bin/activate
python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base.en', device='cpu', compute_type='int8')
print('Whisper loaded OK')
"This downloads the model on first run. Use tiny.en for faster but less accurate results
on a 4 GB Pi.
python -c "
import sounddevice as sd
print(sd.query_devices())
"Note the index numbers for your microphone (input) and speaker (output).
Update MIC_DEVICE in stt_service.py and OUTPUT_DEVICE in tts_service.py if needed.
python -c "
import sounddevice as sd
info = sd.query_devices(0) # replace 0 with your mic device index
print(f'Default sample rate: {info[\"default_samplerate\"]}')
"Update MIC_RATE in stt_service.py if it's not 44100.
python -c "
import numpy as np
import sounddevice as sd
print('Recording 3 seconds of silence...')
audio = sd.rec(44100 * 3, samplerate=44100, channels=1, dtype='int16', device=0)
sd.wait()
rms = np.sqrt(np.mean(audio.flatten().astype(np.float32) ** 2))
print(f'Silence RMS: {rms:.0f}')
print('Now speak for 3 seconds...')
audio = sd.rec(44100 * 3, samplerate=44100, channels=1, dtype='int16', device=0)
sd.wait()
rms = np.sqrt(np.mean(audio.flatten().astype(np.float32) ** 2))
print(f'Speech RMS: {rms:.0f}')
"Set SILENCE_THRESHOLD in stt_service.py to a value between the two RMS readings.
"""STT Service: mic → openwakeword (wake word) → faster-whisper (transcribe) → MQTT publish.
Listens to the microphone continuously, streams audio to openwakeword for wake word detection.
On detection, records speech until silence, transcribes with faster-whisper, and publishes
the transcript to MQTT.
Configuration is via the constants below.
"""
import asyncio
import collections
import json
import time
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
from wyoming.client import AsyncClient
from wyoming.audio import AudioChunk
from wyoming.wake import Detect, Detection
import paho.mqtt.client as mqtt
# --- Configuration ---
WAKEWORD_URI = "tcp://localhost:10400"
WAKEWORD_NAME = "ok_nabu"
WHISPER_MODEL = "base.en"
WHISPER_DEVICE = "cpu"
WHISPER_COMPUTE_TYPE = "int8"
MQTT_BROKER = "edgy-hermes"
MQTT_PORT = 1883
MQTT_TOPIC_TRANSCRIPT = "voice/transcript"
MIC_DEVICE = 0 # USB PnP Sound Device
MIC_RATE = 44100 # native mic sample rate
TARGET_RATE = 16000 # rate needed by openwakeword and whisper
CHANNELS = 1
CHUNK_DURATION_MS = 100 # ms per audio chunk
MIC_CHUNK_SAMPLES = MIC_RATE * CHUNK_DURATION_MS // 1000
# Silence detection for end-of-speech
SILENCE_THRESHOLD = 20 # RMS threshold - adjust for your mic/environment
SILENCE_DURATION_S = 2.0 # seconds of silence to end recording
MAX_RECORDING_S = 10 # maximum recording duration
# Pre-buffer: keep this many seconds of audio before wake word detection
PRE_BUFFER_S = 1.5
PRE_BUFFER_CHUNKS = int(PRE_BUFFER_S * 1000 / CHUNK_DURATION_MS)
def downsample(audio: np.ndarray, from_rate: int, to_rate: int) -> np.ndarray:
"""Downsample audio using linear interpolation."""
ratio = to_rate / from_rate
num_samples = int(len(audio) * ratio)
indices = np.arange(num_samples) / ratio
return np.interp(indices, np.arange(len(audio)), audio.astype(np.float32)).astype(np.int16)
def record_chunk() -> tuple[np.ndarray, np.ndarray]:
"""Record a chunk from the mic. Returns (original 44100 Hz, downsampled 16000 Hz)."""
audio = sd.rec(MIC_CHUNK_SAMPLES, samplerate=MIC_RATE, channels=CHANNELS,
dtype="int16", device=MIC_DEVICE)
sd.wait()
audio = audio.flatten()
audio_16k = downsample(audio, MIC_RATE, TARGET_RATE)
return audio, audio_16k
def rms(audio: np.ndarray) -> float:
"""Root mean square of audio samples."""
return float(np.sqrt(np.mean(audio.astype(np.float32) ** 2)))
def record_until_silence() -> bytes:
"""Record audio from the mic until silence is detected. Returns 16kHz PCM bytes."""
chunks = []
silent_chunks = 0
silence_chunks_needed = int(SILENCE_DURATION_S * 1000 / CHUNK_DURATION_MS)
max_chunks = int(MAX_RECORDING_S * 1000 / CHUNK_DURATION_MS)
print(" Recording...")
for i in range(max_chunks):
audio_44k, audio_16k = record_chunk()
chunks.append(audio_16k.tobytes())
if rms(audio_44k) < SILENCE_THRESHOLD:
silent_chunks += 1
if silent_chunks >= silence_chunks_needed:
print(" Silence detected, stopping recording.")
break
else:
silent_chunks = 0
return b"".join(chunks)
async def run():
# Initialise MQTT
mqtt_client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)
mqtt_client.connect(MQTT_BROKER, MQTT_PORT)
mqtt_client.loop_start()
print(f"MQTT connected to {MQTT_BROKER}:{MQTT_PORT}")
# Load whisper model
print(f"Loading whisper model '{WHISPER_MODEL}'...")
model = WhisperModel(WHISPER_MODEL, device=WHISPER_DEVICE, compute_type=WHISPER_COMPUTE_TYPE)
print("Whisper model loaded.")
print(f"Connecting to openwakeword at {WAKEWORD_URI}...")
async with AsyncClient.from_uri(WAKEWORD_URI) as wake_client:
# Request detection
await wake_client.write_event(Detect(names=[WAKEWORD_NAME]).event())
print(f"Listening for wake word '{WAKEWORD_NAME}'...")
# Rolling buffer of recent 16kHz audio chunks
pre_buffer = collections.deque(maxlen=PRE_BUFFER_CHUNKS)
while True:
# Capture a chunk from the mic and downsample
_, audio_16k = record_chunk()
# Keep recent audio in rolling buffer
pre_buffer.append(audio_16k.tobytes())
# Send 16kHz audio to openwakeword
chunk = AudioChunk(
audio=audio_16k.tobytes(), rate=TARGET_RATE, width=2, channels=CHANNELS
)
await wake_client.write_event(chunk.event())
# Check for detection (non-blocking)
try:
event = await asyncio.wait_for(wake_client.read_event(), timeout=0.01)
except asyncio.TimeoutError:
event = None
if event and Detection.is_type(event.type):
print("Wake word detected!")
# Grab buffered audio from before detection
buffered_audio = b"".join(pre_buffer)
pre_buffer.clear()
# Record speech until silence (returns 16kHz audio)
new_audio = record_until_silence()
# Combine pre-buffer with new recording
pcm_audio = buffered_audio + new_audio
# Transcribe
print(" Transcribing...")
audio_array = np.frombuffer(pcm_audio, dtype=np.int16).astype(np.float32) / 32768.0
segments, info = model.transcribe(audio_array, beam_size=1, language="en")
text = " ".join(seg.text.strip() for seg in segments).strip()
if text:
print(f" Transcript: {text}")
payload = json.dumps({
"text": text,
"timestamp": time.time(),
})
mqtt_client.publish(MQTT_TOPIC_TRANSCRIPT, payload)
print(f" Published to {MQTT_TOPIC_TRANSCRIPT}")
else:
print(" No speech detected.")
# Resume wake word listening
await wake_client.write_event(Detect(names=[WAKEWORD_NAME]).event())
print(f"Listening for wake word '{WAKEWORD_NAME}'...")
if __name__ == "__main__":
try:
asyncio.run(run())
except KeyboardInterrupt:
print("\nStopping STT service.")"""TTS Service: MQTT subscribe → Piper TTS → audio playback.
Subscribes to an MQTT topic, synthesizes incoming text with Piper, and plays the audio.
Runs independently of the STT service.
Configuration is via the constants below.
"""
import io
import json
import wave
import numpy as np
import sounddevice as sd
import paho.mqtt.client as mqtt
from piper.voice import PiperVoice
# --- Configuration ---
MQTT_BROKER = "edgy-hermes"
MQTT_PORT = 1883
MQTT_TOPIC_SPEAK = "voice/speak"
PIPER_MODEL_PATH = "piper-data/en_US-lessac-medium.onnx"
# Set to None to use default output device
OUTPUT_DEVICE = None
def synthesize_and_play(voice: PiperVoice, text: str):
"""Synthesize text to audio and play it."""
print(f" Synthesizing: {text}")
wav_buffer = io.BytesIO()
with wave.open(wav_buffer, "wb") as wav_file:
voice.synthesize_wav(text, wav_file)
wav_buffer.seek(0)
with wave.open(wav_buffer, "rb") as wav_file:
sample_rate = wav_file.getframerate()
frames = wav_file.readframes(wav_file.getnframes())
audio = np.frombuffer(frames, dtype=np.int16).astype(np.float32)
# Resample from model rate (22050) to 48000 for ALSA compatibility
playback_rate = 48000
resample_ratio = playback_rate / sample_rate
num_samples = int(len(audio) * resample_ratio)
indices = np.arange(num_samples) / resample_ratio
audio_resampled = np.interp(indices, np.arange(len(audio)), audio).astype(np.int16)
print(f" Playing audio ({len(audio) / sample_rate:.1f}s)...")
sd.play(audio_resampled, samplerate=playback_rate, device=OUTPUT_DEVICE)
sd.wait()
print(" Done.")
def run():
print(f"Loading Piper voice from {PIPER_MODEL_PATH}...")
voice = PiperVoice.load(PIPER_MODEL_PATH)
print(f"Piper voice loaded (sample rate: {voice.config.sample_rate})")
def on_connect(client, userdata, flags, reason_code, properties):
print(f"MQTT connected to {MQTT_BROKER}:{MQTT_PORT}")
client.subscribe(MQTT_TOPIC_SPEAK)
print(f"Subscribed to {MQTT_TOPIC_SPEAK}")
def on_message(client, userdata, msg):
try:
# Accept plain text or JSON with a "text" field
try:
payload = json.loads(msg.payload.decode())
text = payload.get("text", "")
except (json.JSONDecodeError, UnicodeDecodeError):
text = msg.payload.decode()
text = text.strip()
if text:
synthesize_and_play(voice, text)
except Exception as e:
print(f" Error: {e}")
mqtt_client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)
mqtt_client.on_connect = on_connect
mqtt_client.on_message = on_message
mqtt_client.connect(MQTT_BROKER, MQTT_PORT)
print("TTS service running. Waiting for messages...")
mqtt_client.loop_forever()
if __name__ == "__main__":
try:
run()
except KeyboardInterrupt:
print("\nStopping TTS service.")In two separate terminals:
# Terminal 1 - STT service
cd ~/git/active/voice-services
source venv/bin/activate
python stt_service.py
# Terminal 2 - TTS service
cd ~/git/active/voice-services
source venv/bin/activate
python tts_service.pyTest TTS from any machine with mosquitto-clients:
mosquitto_pub -h edgy-hermes -t voice/speak -m "Hello Romilly"Monitor STT output:
mosquitto_sub -h edgy-hermes -t voice/transcriptThen say "ok nabu" followed by a voice command. The transcript appears as JSON
with text and timestamp fields.
| Setting | Default | Description |
|---|---|---|
WAKEWORD_URI |
tcp://localhost:10400 |
openwakeword Wyoming server |
WAKEWORD_NAME |
ok_nabu |
Wake word to listen for. Built-in options: ok_nabu, hey_jarvis, alexa, hey_mycroft, hey_rhasspy |
WHISPER_MODEL |
base.en |
Whisper model. tiny.en is faster, small.en more accurate |
MQTT_BROKER |
edgy-hermes |
MQTT broker hostname |
MQTT_TOPIC_TRANSCRIPT |
voice/transcript |
Topic to publish transcripts |
MIC_DEVICE |
0 |
Microphone device index |
MIC_RATE |
44100 |
Mic native sample rate |
SILENCE_THRESHOLD |
20 |
RMS below which audio counts as silence |
SILENCE_DURATION_S |
2.0 |
Seconds of silence to stop recording |
MAX_RECORDING_S |
10 |
Maximum recording duration |
PRE_BUFFER_S |
1.5 |
Seconds of audio kept before wake word detection |
| Setting | Default | Description |
|---|---|---|
MQTT_BROKER |
edgy-hermes |
MQTT broker hostname |
MQTT_TOPIC_SPEAK |
voice/speak |
Topic to subscribe to. Accepts plain text or JSON {"text": "..."} |
PIPER_MODEL_PATH |
piper-data/en_US-lessac-medium.onnx |
Path to Piper voice model |
OUTPUT_DEVICE |
None (system default) |
Speaker device index |
- piper-tts is missing
pathvalidatefrom its dependencies — install it manually. - piper-tts CLI model download doesn't work reliably — download
.onnxand.onnx.jsonfiles directly from Hugging Face. - Use
PiperVoice.load()andsynthesize_wav()— notsynthesize()(doesn't set WAV headers) and notsynthesize_stream_raw()(doesn't exist in all versions). - USB mics often only support 44100 Hz, not 16000 Hz — record at the native rate and downsample.
- ALSA rejects 22050 Hz playback on many USB speakers — resample to 48000 Hz.
- openwakeword depends on tflite-runtime which doesn't support Python 3.12+ — run it in Docker.
- Silence threshold must be calibrated to your mic. Measure RMS of silence and speech, set the threshold between them.