Engineering · 12 min read
A deep-dive into the WebSocket lifecycle and audio streaming challenges we encountered deploying Groq's Orpheus TTS, and the architectural patterns that finally made it production-ready.

Ayush Jain & Bijoy Roy
Katonic Engineering Team
Voice AI latency
time-to-first-byte via Groq's LPU
The bug:
WebSocket lifecycle, fixed
TL;DR
Groq's Orpheus TTS is fast. Really fast. Sub-200ms time-to-first-byte, ~100 characters/second, with voices that sound genuinely human. But when we shipped it to production, users started reporting that the second message was silent. The problem wasn't Groq. The problem wasn't Orpheus. The problem was us - specifically, our WebSocket lifecycle management and audio streaming architecture.
We were adding real-time voice capabilities to our AI agent platform. Users could chat with AI agents via text, but we wanted to add voice output - the agent speaks its responses aloud.
For the Arabic-speaking market, we specifically needed Saudi dialect support. That led us to Groq's Orpheus Arabic-Saudi model (canopylabs/orpheus-arabic-saudi), which offers:
The API integration was straightforward:
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
response = client.audio.speech.create(
model="canopylabs/orpheus-arabic-saudi",
voice="noura",
input="مرحبا بكم في تطبيقنا",
response_format="wav"
)
# Save or stream the audio
with open("output.wav", "wb") as f:
f.write(response.content)Easy, right? It worked perfectly in development.
Our architecture looked like this:
Browser
(Frontend)
Backend
Server
Groq API
(Orpheus)
Flow:
After deploying, support tickets started rolling in:
| Issue | Frequency | User Report |
|---|---|---|
| Second message silent | ~30% of sessions | “First response plays fine, but when I send another message, no audio” |
| Audio never completes | Intermittent | “The agent just stops mid-sentence” |
| UI stuck on 'speaking' | ~15% of sessions | “It says the agent is speaking but there's no sound” |
| STOP doesn't work | Frequent | “I clicked stop but it kept playing” |
| Memory usage climbing | Over time | (Internal monitoring alert) |
The frustrating part: none of these reproduced locally. Our dev environment worked flawlessly.
After days of debugging, we identified the root cause. It wasn't one bug - it was an architectural flaw: we had no clear ownership of stateful resources.
socket.on('tts-request', async (text) => {
// Problem 1: Creating a new WebSocket every time
// Old connections were never explicitly closed
const groqWs = new WebSocket(GROQ_STREAMING_URL);
groqWs.on('message', (audioChunk) => {
socket.emit('audio-chunk', audioChunk);
});
// Problem 2: Adding listeners without removing old ones
socket.on('stop', () => {
groqWs.close(); // Best effort, not guaranteed
});
// Problem 3: Stream end was implicit
groqWs.on('close', () => {
socket.emit('stream-end');
});
});groqWs.on('close') would always fire. It didn't.We rewrote both backend and frontend around three principles:
// Store the external connection on the socket itself
function cleanupExternalConnection(socket) {
// Idempotent - safe to call multiple times
if (socket.data.groqWs) {
socket.data.groqWs.removeAllListeners();
if (socket.data.groqWs.readyState === WebSocket.OPEN) {
socket.data.groqWs.close();
}
socket.data.groqWs = null;
}
// Only send end signal once
if (!socket.data.endSignalSent) {
socket.data.endSignalSent = true;
socket.emit('stream-end');
}
}
socket.on('tts-request', async (text) => {
// ALWAYS clean up previous session first
cleanupExternalConnection(socket);
// Reset state for new session
socket.data.endSignalSent = false;
socket.removeAllListeners('stop');
// Create new connection with explicit ownership
const groqWs = new WebSocket(GROQ_STREAMING_URL);
socket.data.groqWs = groqWs;
// ... rest of implementation
});class AudioStreamPlayer {
constructor() {
this.audioContext = new AudioContext();
this.nextStartTime = 0;
this.activeSources = new Set();
this.streamEnded = false;
}
reset() {
// Stop all active sources
this.activeSources.forEach(source => {
try { source.stop(); } catch (e) { }
});
this.activeSources.clear();
this.nextStartTime = this.audioContext.currentTime;
this.streamEnded = false;
}
async playChunk(pcmData) {
// Schedule on timeline (not "now")
const startTime = Math.max(
this.nextStartTime,
this.audioContext.currentTime
);
source.start(startTime);
this.nextStartTime = startTime + audioBuffer.duration;
}
}Standard unit tests won't catch these issues. Here's what we added:
Send 100 consecutive messages and verify no resource leaks.
Issue STOP commands at random points during streaming.
Specifically test that message 2 plays correctly after message 1.
Introduce random latency and packet loss to verify graceful degradation.
We started this project excited about Groq's speed and Orpheus's voice quality. We ended up learning a hard lesson about real-time systems architecture.
The good news: once we fixed our lifecycle management, everything worked beautifully. Groq Orpheus now powers voice output for thousands of users, in both English and Saudi Arabic, with reliable playback and clean state management.
The lesson: if you can't clearly explain when something starts and when it ends, it will break in production.
§ Related articles
Continue learning about production AI architecture
Talk to our engineering team about deploying real-time voice agents that hold up under real-world traffic.
