Home / Blog / Engineering§ Engineering

Engineering · 12 min read

We shipped Groq Orpheus TTS to production - here's what broke.

A deep-dive into the WebSocket lifecycle and audio streaming challenges we encountered deploying Groq's Orpheus TTS, and the architectural patterns that finally made it production-ready.

Ayush Jain & Bijoy Roy

Katonic Engineering Team

June 29, 2026

Voice AI latency

<200ms

time-to-first-byte via Groq's LPU

The bug:

WebSocket lifecycle, fixed

TL;DR

Groq's Orpheus TTS is fast. Really fast. Sub-200ms time-to-first-byte, ~100 characters/second, with voices that sound genuinely human. But when we shipped it to production, users started reporting that the second message was silent. The problem wasn't Groq. The problem wasn't Orpheus. The problem was us - specifically, our WebSocket lifecycle management and audio streaming architecture.

What We Were Building

We were adding real-time voice capabilities to our AI agent platform. Users could chat with AI agents via text, but we wanted to add voice output - the agent speaks its responses aloud.

For the Arabic-speaking market, we specifically needed Saudi dialect support. That led us to Groq's Orpheus Arabic-Saudi model (canopylabs/orpheus-arabic-saudi), which offers:

Four authentic Saudi dialect voices: Fahad, Sultan, Lulwa, Noura
Natural pronunciation with regional nuances
Sub-200ms time-to-first-byte latency via Groq's LPU inference
~100 characters/second throughput
Simple, OpenAI-compatible API

The API integration was straightforward:

Python

from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.audio.speech.create(
    model="canopylabs/orpheus-arabic-saudi",
    voice="noura",
    input="مرحبا بكم في تطبيقنا",
    response_format="wav"
)

# Save or stream the audio
with open("output.wav", "wb") as f:
    f.write(response.content)

Easy, right? It worked perfectly in development.

The Setup: Our Architecture

Our architecture looked like this:

Browser

(Frontend)

⇄ WS

Backend

Server

⇄ HTTPS

Groq API

(Orpheus)

Flow:

User sends a message
AI agent generates a text response
Backend sends text to Groq's speech endpoint
Groq returns audio (WAV format)
Backend streams audio chunks to frontend via WebSocket
Frontend plays audio using Web Audio API

What Broke in Production

After deploying, support tickets started rolling in:

Issue	Frequency	User Report
Second message silent	~30% of sessions	“First response plays fine, but when I send another message, no audio”
Audio never completes	Intermittent	“The agent just stops mid-sentence”
UI stuck on 'speaking'	~15% of sessions	“It says the agent is speaking but there's no sound”
STOP doesn't work	Frequent	“I clicked stop but it kept playing”
Memory usage climbing	Over time	(Internal monitoring alert)

The frustrating part: none of these reproduced locally. Our dev environment worked flawlessly.

Root Cause: Unclear State Ownership

After days of debugging, we identified the root cause. It wasn't one bug - it was an architectural flaw: we had no clear ownership of stateful resources.

Backend Problems

JavaScript DON'T DO THIS

socket.on('tts-request', async (text) => {
  // Problem 1: Creating a new WebSocket every time
  // Old connections were never explicitly closed
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  
  groqWs.on('message', (audioChunk) => {
    socket.emit('audio-chunk', audioChunk);
  });
  
  // Problem 2: Adding listeners without removing old ones
  socket.on('stop', () => {
    groqWs.close(); // Best effort, not guaranteed
  });
  
  // Problem 3: Stream end was implicit
  groqWs.on('close', () => {
    socket.emit('stream-end');
  });
});

What went wrong

Multiple WebSocket connections: Each TTS request created a new connection without closing the previous one.
Listener accumulation: Every request added new listeners. After 5 messages, STOP would trigger 5 close attempts.
Implicit stream completion: We assumed groqWs.on('close') would always fire. It didn't.

The Fix: Deterministic Lifecycle Management

We rewrote both backend and frontend around three principles:

The Three Laws of Real-Time State Management

Law 1 - Ownership: Every resource must have exactly one owner at any time.
Law 2 - Cleanup: All cleanup functions must be idempotent.
Law 3 - Completion: Stream completion must be explicit, never inferred.

Single WebSocket Per Session

JavaScript FIXED

// Store the external connection on the socket itself
function cleanupExternalConnection(socket) {
  // Idempotent - safe to call multiple times
  if (socket.data.groqWs) {
    socket.data.groqWs.removeAllListeners();
    if (socket.data.groqWs.readyState === WebSocket.OPEN) {
      socket.data.groqWs.close();
    }
    socket.data.groqWs = null;
  }
  // Only send end signal once
  if (!socket.data.endSignalSent) {
    socket.data.endSignalSent = true;
    socket.emit('stream-end');
  }
}

socket.on('tts-request', async (text) => {
  // ALWAYS clean up previous session first
  cleanupExternalConnection(socket);
  // Reset state for new session
  socket.data.endSignalSent = false;
  socket.removeAllListeners('stop');
  // Create new connection with explicit ownership
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  socket.data.groqWs = groqWs;
  // ... rest of implementation
});

Timeline-Driven Audio Playback

JavaScript FIXED

class AudioStreamPlayer {
  constructor() {
    this.audioContext = new AudioContext();
    this.nextStartTime = 0;
    this.activeSources = new Set();
    this.streamEnded = false;
  }
  reset() {
    // Stop all active sources
    this.activeSources.forEach(source => {
      try { source.stop(); } catch (e) { }
    });
    this.activeSources.clear();
    this.nextStartTime = this.audioContext.currentTime;
    this.streamEnded = false;
  }
  async playChunk(pcmData) {
    // Schedule on timeline (not "now")
    const startTime = Math.max(
      this.nextStartTime,
      this.audioContext.currentTime
    );
    source.start(startTime);
    this.nextStartTime = startTime + audioBuffer.duration;
  }
}

Testing for Production Reliability

Standard unit tests won't catch these issues. Here's what we added:

1. Multi-Message Stress Test

Send 100 consecutive messages and verify no resource leaks.

2. Rapid Interrupt Test

Issue STOP commands at random points during streaming.

3. Second Message Test

Specifically test that message 2 plays correctly after message 1.

4. Chaos Test

Introduce random latency and packet loss to verify graceful degradation.

Key Takeaways

5 Lessons Learned

Groq + Orpheus is production-ready. Your integration might not be. The Groq API did exactly what it was supposed to do. Our bugs were in our own code.
Demos work because conditions are perfect. In production, users interrupt, retry, and behave unpredictably.
State ownership is everything. Every resource needs exactly one owner. When you can't answer “who closes this WebSocket?”, you have a bug.
Make completion explicit. Never infer that a stream has ended. Send explicit signals and verify them.
Test for production, not demos. Multi-message stress tests, interrupt tests, chaos tests.

Conclusion

We started this project excited about Groq's speed and Orpheus's voice quality. We ended up learning a hard lesson about real-time systems architecture.

The good news: once we fixed our lifecycle management, everything worked beautifully. Groq Orpheus now powers voice output for thousands of users, in both English and Saudi Arabic, with reliable playback and clean state management.

The lesson: if you can't clearly explain when something starts and when it ends, it will break in production.

Share this article

§ Related articles

Keep reading.

Continue learning about production AI architecture

Building voice AI for production?

Talk to our engineering team about deploying real-time voice agents that hold up under real-world traffic.

Talk to Engineering Read More Engineering Posts

Home / Blog / Engineering§ Engineering

Engineering · 12 min read

We shipped Groq Orpheus TTS to production - here's what broke.

A deep-dive into the WebSocket lifecycle and audio streaming challenges we encountered deploying Groq's Orpheus TTS, and the architectural patterns that finally made it production-ready.

Ayush Jain & Bijoy Roy

Katonic Engineering Team

June 29, 2026

Voice AI latency

<200ms

time-to-first-byte via Groq's LPU

The bug:

WebSocket lifecycle, fixed

TL;DR

What We Were Building

We were adding real-time voice capabilities to our AI agent platform. Users could chat with AI agents via text, but we wanted to add voice output - the agent speaks its responses aloud.

For the Arabic-speaking market, we specifically needed Saudi dialect support. That led us to Groq's Orpheus Arabic-Saudi model (canopylabs/orpheus-arabic-saudi), which offers:

Four authentic Saudi dialect voices: Fahad, Sultan, Lulwa, Noura
Natural pronunciation with regional nuances
Sub-200ms time-to-first-byte latency via Groq's LPU inference
~100 characters/second throughput
Simple, OpenAI-compatible API

The API integration was straightforward:

Python

from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.audio.speech.create(
    model="canopylabs/orpheus-arabic-saudi",
    voice="noura",
    input="مرحبا بكم في تطبيقنا",
    response_format="wav"
)

# Save or stream the audio
with open("output.wav", "wb") as f:
    f.write(response.content)

Easy, right? It worked perfectly in development.

The Setup: Our Architecture

Our architecture looked like this:

Browser

(Frontend)

⇄ WS

Backend

Server

⇄ HTTPS

Groq API

(Orpheus)

Flow:

User sends a message
AI agent generates a text response
Backend sends text to Groq's speech endpoint
Groq returns audio (WAV format)
Backend streams audio chunks to frontend via WebSocket
Frontend plays audio using Web Audio API

What Broke in Production

After deploying, support tickets started rolling in:

Issue	Frequency	User Report
Second message silent	~30% of sessions	“First response plays fine, but when I send another message, no audio”
Audio never completes	Intermittent	“The agent just stops mid-sentence”
UI stuck on 'speaking'	~15% of sessions	“It says the agent is speaking but there's no sound”
STOP doesn't work	Frequent	“I clicked stop but it kept playing”
Memory usage climbing	Over time	(Internal monitoring alert)

The frustrating part: none of these reproduced locally. Our dev environment worked flawlessly.

Root Cause: Unclear State Ownership

After days of debugging, we identified the root cause. It wasn't one bug - it was an architectural flaw: we had no clear ownership of stateful resources.

Backend Problems

JavaScript DON'T DO THIS

socket.on('tts-request', async (text) => {
  // Problem 1: Creating a new WebSocket every time
  // Old connections were never explicitly closed
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  
  groqWs.on('message', (audioChunk) => {
    socket.emit('audio-chunk', audioChunk);
  });
  
  // Problem 2: Adding listeners without removing old ones
  socket.on('stop', () => {
    groqWs.close(); // Best effort, not guaranteed
  });
  
  // Problem 3: Stream end was implicit
  groqWs.on('close', () => {
    socket.emit('stream-end');
  });
});

What went wrong

The Fix: Deterministic Lifecycle Management

We rewrote both backend and frontend around three principles:

The Three Laws of Real-Time State Management

Single WebSocket Per Session

JavaScript FIXED

// Store the external connection on the socket itself
function cleanupExternalConnection(socket) {
  // Idempotent - safe to call multiple times
  if (socket.data.groqWs) {
    socket.data.groqWs.removeAllListeners();
    if (socket.data.groqWs.readyState === WebSocket.OPEN) {
      socket.data.groqWs.close();
    }
    socket.data.groqWs = null;
  }
  // Only send end signal once
  if (!socket.data.endSignalSent) {
    socket.data.endSignalSent = true;
    socket.emit('stream-end');
  }
}

socket.on('tts-request', async (text) => {
  // ALWAYS clean up previous session first
  cleanupExternalConnection(socket);
  // Reset state for new session
  socket.data.endSignalSent = false;
  socket.removeAllListeners('stop');
  // Create new connection with explicit ownership
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  socket.data.groqWs = groqWs;
  // ... rest of implementation
});

Timeline-Driven Audio Playback

JavaScript FIXED

class AudioStreamPlayer {
  constructor() {
    this.audioContext = new AudioContext();
    this.nextStartTime = 0;
    this.activeSources = new Set();
    this.streamEnded = false;
  }
  reset() {
    // Stop all active sources
    this.activeSources.forEach(source => {
      try { source.stop(); } catch (e) { }
    });
    this.activeSources.clear();
    this.nextStartTime = this.audioContext.currentTime;
    this.streamEnded = false;
  }
  async playChunk(pcmData) {
    // Schedule on timeline (not "now")
    const startTime = Math.max(
      this.nextStartTime,
      this.audioContext.currentTime
    );
    source.start(startTime);
    this.nextStartTime = startTime + audioBuffer.duration;
  }
}

Testing for Production Reliability

Standard unit tests won't catch these issues. Here's what we added:

1. Multi-Message Stress Test

Send 100 consecutive messages and verify no resource leaks.

2. Rapid Interrupt Test

Issue STOP commands at random points during streaming.

3. Second Message Test

Specifically test that message 2 plays correctly after message 1.

4. Chaos Test

Introduce random latency and packet loss to verify graceful degradation.

Key Takeaways

5 Lessons Learned

Groq + Orpheus is production-ready. Your integration might not be. The Groq API did exactly what it was supposed to do. Our bugs were in our own code.
Demos work because conditions are perfect. In production, users interrupt, retry, and behave unpredictably.
State ownership is everything. Every resource needs exactly one owner. When you can't answer “who closes this WebSocket?”, you have a bug.
Make completion explicit. Never infer that a stream has ended. Send explicit signals and verify them.
Test for production, not demos. Multi-message stress tests, interrupt tests, chaos tests.

Conclusion

We started this project excited about Groq's speed and Orpheus's voice quality. We ended up learning a hard lesson about real-time systems architecture.

The lesson: if you can't clearly explain when something starts and when it ends, it will break in production.

Share this article

§ Related articles

Keep reading.

Continue learning about production AI architecture

Building voice AI for production?

Talk to our engineering team about deploying real-time voice agents that hold up under real-world traffic.

Talk to Engineering Read More Engineering Posts

We shipped Groq Orpheus TTS to production - here's what broke.

What We Were Building

The Setup: Our Architecture

What Broke in Production

Root Cause: Unclear State Ownership

Backend Problems

What went wrong

The Fix: Deterministic Lifecycle Management

The Three Laws of Real-Time State Management

Single WebSocket Per Session

Timeline-Driven Audio Playback

Testing for Production Reliability

1. Multi-Message Stress Test

2. Rapid Interrupt Test

3. Second Message Test

4. Chaos Test

Key Takeaways

5 Lessons Learned

Conclusion

Keep reading.

The 3 Layers Every Production Agent Needs: Brain, Body, and Guardrails

MCP vs A2A vs ANP vs ACP vs AGORA: The Complete Guide to AI Agent Communication Protocols

Why Text-In/Text-Out is Dead: The Rise of Full-Stack Agents

Building voice AI for production?

We shipped Groq Orpheus TTS to production - here's what broke.

What We Were Building

The Setup: Our Architecture

What Broke in Production

Root Cause: Unclear State Ownership

Backend Problems

What went wrong

The Fix: Deterministic Lifecycle Management

The Three Laws of Real-Time State Management

Single WebSocket Per Session

Timeline-Driven Audio Playback

Testing for Production Reliability

1. Multi-Message Stress Test

2. Rapid Interrupt Test

3. Second Message Test

4. Chaos Test

Key Takeaways

5 Lessons Learned

Conclusion

Keep reading.

The 3 Layers Every Production Agent Needs: Brain, Body, and Guardrails

MCP vs A2A vs ANP vs ACP vs AGORA: The Complete Guide to AI Agent Communication Protocols

Why Text-In/Text-Out is Dead: The Rise of Full-Stack Agents

Building voice AI for production?