TL;DR
Groq's Orpheus TTS is fast. Really fast. Sub-200ms time-to-first-byte, ~100 characters/second, with voices that sound genuinely human. But when we shipped it to production, users started reporting that the second message was silent. The problem wasn't Groq. The problem wasn't Orpheus. The problem was us - specifically, our WebSocket lifecycle management and audio streaming architecture.
What We Were Building
We were adding real-time voice capabilities to our AI agent platform. Users could chat with AI agents via text, but we wanted to add voice output - the agent speaks its responses aloud.
For the Arabic-speaking market, we specifically needed Saudi dialect support. That led us to
Groq's Orpheus Arabic-Saudi model (canopylabs/orpheus-arabic-saudi),
which offers:
- Four authentic Saudi dialect voices: Fahad, Sultan, Lulwa, Noura
- Natural pronunciation with regional nuances
- Sub-200ms time-to-first-byte latency via Groq's LPU inference
- ~100 characters/second throughput
- Simple, OpenAI-compatible API
The API integration was straightforward:
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
response = client.audio.speech.create(
model="canopylabs/orpheus-arabic-saudi",
voice="noura",
input="مرحبا بكم في تطبيقنا",
response_format="wav"
)
# Save or stream the audio
with open("output.wav", "wb") as f:
f.write(response.content)
Easy, right? It worked perfectly in development.
The Setup: Our Architecture
Our architecture looked like this:
Flow:
- User sends a message
- AI agent generates a text response
- Backend sends text to Groq's speech endpoint
- Groq returns audio (WAV format)
- Backend streams audio chunks to frontend via WebSocket
- Frontend plays audio using Web Audio API
What Broke in Production
After deploying, support tickets started rolling in:
| Issue | Frequency | User Report |
|---|---|---|
| Second message silent | ~30% of sessions | "First response plays fine, but when I send another message, no audio" |
| Audio never completes | Intermittent | "The agent just stops mid-sentence" |
| UI stuck on 'speaking' | ~15% of sessions | "It says the agent is speaking but there's no sound" |
| STOP doesn't work | Frequent | "I clicked stop but it kept playing" |
| Memory usage climbing | Over time | (Internal monitoring alert) |
The frustrating part: none of these reproduced locally. Our dev environment worked flawlessly.
Root Cause: Unclear State Ownership
After days of debugging, we identified the root cause. It wasn't one bug - it was an architectural flaw: we had no clear ownership of stateful resources.
Backend Problems
socket.on('tts-request', async (text) => {
// Problem 1: Creating a new WebSocket every time
// Old connections were never explicitly closed
const groqWs = new WebSocket(GROQ_STREAMING_URL);
groqWs.on('message', (audioChunk) => {
socket.emit('audio-chunk', audioChunk);
});
// Problem 2: Adding listeners without removing old ones
socket.on('stop', () => {
groqWs.close(); // Best effort, not guaranteed
});
// Problem 3: Stream end was implicit
groqWs.on('close', () => {
socket.emit('stream-end');
});
});
What went wrong
Multiple WebSocket connections: Each TTS request created a new connection without closing
the previous one.
Listener accumulation: Every request added new listeners. After 5 messages, STOP would
trigger 5 close attempts.
Implicit stream completion: We assumed groqWs.on('close') would always fire.
It didn't.
The Fix: Deterministic Lifecycle Management
We rewrote both backend and frontend around three principles:
The Three Laws of Real-Time State Management
Law 1 - Ownership: Every resource must have exactly one owner at any time.
Law 2 - Cleanup: All cleanup functions must be idempotent.
Law 3 - Completion: Stream completion must be explicit, never inferred.
Single WebSocket Per Session
// Store the external connection on the socket itself
function cleanupExternalConnection(socket) {
// Idempotent - safe to call multiple times
if (socket.data.groqWs) {
socket.data.groqWs.removeAllListeners();
if (socket.data.groqWs.readyState === WebSocket.OPEN) {
socket.data.groqWs.close();
}
socket.data.groqWs = null;
}
// Only send end signal once
if (!socket.data.endSignalSent) {
socket.data.endSignalSent = true;
socket.emit('stream-end');
}
}
socket.on('tts-request', async (text) => {
// ALWAYS clean up previous session first
cleanupExternalConnection(socket);
// Reset state for new session
socket.data.endSignalSent = false;
socket.removeAllListeners('stop');
// Create new connection with explicit ownership
const groqWs = new WebSocket(GROQ_STREAMING_URL);
socket.data.groqWs = groqWs;
// ... rest of implementation
});
Timeline-Driven Audio Playback
class AudioStreamPlayer {
constructor() {
this.audioContext = new AudioContext();
this.nextStartTime = 0;
this.activeSources = new Set();
this.streamEnded = false;
}
reset() {
// Stop all active sources
this.activeSources.forEach(source => {
try { source.stop(); } catch (e) { }
});
this.activeSources.clear();
this.nextStartTime = this.audioContext.currentTime;
this.streamEnded = false;
}
async playChunk(pcmData) {
// Schedule on timeline (not "now")
const startTime = Math.max(
this.nextStartTime,
this.audioContext.currentTime
);
source.start(startTime);
this.nextStartTime = startTime + audioBuffer.duration;
}
}
Testing for Production Reliability
Standard unit tests won't catch these issues. Here's what we added:
1. Multi-Message Stress Test
Send 100 consecutive messages and verify no resource leaks.
2. Rapid Interrupt Test
Issue STOP commands at random points during streaming.
3. Second Message Test
Specifically test that message 2 plays correctly after message 1.
4. Chaos Test
Introduce random latency and packet loss to verify graceful degradation.
Key Takeaways
5 Lessons Learned
- Groq + Orpheus is production-ready. Your integration might not be. The Groq API did exactly what it was supposed to do. Our bugs were in our own code.
- Demos work because conditions are perfect. In production, users interrupt, retry, and behave unpredictably.
- State ownership is everything. Every resource needs exactly one owner. When you can't answer "who closes this WebSocket?", you have a bug.
- Make completion explicit. Never infer that a stream has ended. Send explicit signals and verify them.
- Test for production, not demos. Multi-message stress tests, interrupt tests, chaos tests.
Conclusion
We started this project excited about Groq's speed and Orpheus's voice quality. We ended up learning a hard lesson about real-time systems architecture.
The good news: once we fixed our lifecycle management, everything worked beautifully. Groq Orpheus now powers voice output for thousands of users, in both English and Saudi Arabic, with reliable playback and clean state management.
The lesson: if you can't clearly explain when something starts and when it ends, it will break in production.