Architecture Guide · 16 min read
The 5-component blueprint that beat GPT-5 - and keeps your data sovereign. A technical guide to building AI systems that are smarter, not bigger.

Katonic AI
Engineering Team
Self-orchestration
how often GPT-5 calls itself
The alternative:
an 8B orchestrator that routes
When researchers gave GPT-5 the ability to choose which model to call for each task - including itself - something revealing happened. It called itself 98% of the time. Even for simple arithmetic. Even for basic web lookups. The most powerful model in the world couldn't resist using itself.
70%
Cost reduction vs single-model
37.1%
HLE accuracy (beats GPT-5)
2.5×
Faster inference
NVIDIA's response? Train an 8-billion parameter model specifically to make routing decisions. No ego. No self-preference. Just cold optimisation for outcome, efficiency, and cost.
The result: 37.1% accuracy on Humanity's Last Exam vs. GPT-5's 35.1% - at 30% of the cost and 2.5× the speed.
This is tool orchestration. And it's reshaping how enterprises should think about AI architecture.
When a model selects tools (including itself), it faces a conflict of interest. LLMs are trained to be helpful and capable - not to admit "a calculator would handle this better." Self-orchestrating models consistently over-rely on their own reasoning, even when external tools would be superior.
When GPT-5 orchestrates itself, it calls itself 98% of the time - even for tasks where simpler tools would be faster and cheaper.
Enterprises default to "biggest model = safest choice." But this creates a brutal cost structure at scale:
| Query Type | % of Traffic | GPT-5 Cost | Optimal Tool |
|---|---|---|---|
| Simple lookups | 35% | $0.03 | $0.001 (search) |
| Calculations | 15% | $0.03 | $0.0001 (calculator) |
| Code execution | 20% | $0.03 | $0.005 (interpreter) |
| Complex reasoning | 30% | $0.03 | $0.03 (LLM needed) |
Result: 70% of queries are over-served. You're paying GPT-5 prices for calculator tasks.
Self-orchestration with frontier models means every query - even trivial ones - leaves your infrastructure for external APIs. Tool orchestration lets you keep 70-80% of processing local, with full control over what data touches which system.
Definition: An AI architecture where a lightweight, purpose-trained coordinator model decides which tools, models, or APIs to invoke for each step of a task - rather than routing everything through a single powerful model.
The Key Insight: Separate the "decision-making about what to use" from "actually doing the work." A small model trained only for routing can outperform a giant model trying to do everything.
With orchestration, you control the routing layer. You decide which queries stay on-premise (most of them), which escalate to external APIs (only when necessary), and what data touches which system (full audit trail). This isn't just cost optimisation - it's architectural control over your AI stack.
Blueprint for production orchestration:
The Orchestration Blueprint
Five components that beat GPT-5 at 30% of the cost
Classifies queries, routes to tools, decides escalation. E.g., Nemotron-Orchestrator-8B
Catalogue of available tools with capability metadata. Web search, calculators, code interpreters, DB connectors
Specialist models for specific domains. Fine-tuned extractors, coding models, math specialists
Rules for when to call frontier models. Confidence thresholds, task complexity scoring
Monitors outcomes, improves routing over time. Cost tracking, accuracy metrics, latency monitoring
User Query ↓ [Orchestrator] → Classifies intent, complexity, domain ↓ ┌───────────────────────────────────────────────┐ │ Simple fact? → Web Search Tool │ │ Calculation? → Calculator Tool │ │ Code task? → Code Interpreter │ │ Domain-specific? → Fine-tuned 7B Model │ │ Complex reasoning → Escalate to GPT-5/Claude │ └───────────────────────────────────────────────┘ ↓ [Response Validation] → Quality check ↓ Return result OR retry with different tool
NVIDIA's ToolOrchestra framework proves the architecture works.
| Metric | Orchestrator-8B | GPT-5 (Self-Orchestrating) |
|---|---|---|
| HLE Accuracy | ↑ 37.1% | 35.1% |
| GAIA Benchmark | ↑ #1 Ranked | #3 |
| π²-Bench | ↑ Outperforms | Baseline |
| Relative Cost | 30% | 100% |
| Inference Speed | 2.5× faster | Baseline |
1. Separation of concerns: The orchestrator is trained ONLY to route, not to solve. No conflict of interest.
2. Multi-signal training: Optimised for three rewards simultaneously: outcome quality, efficiency, and preference alignment.
3. Tool-agnostic design: Treats both external tools AND other LLMs as callable components. GPT-5 becomes just another tool in the registry.
An 8B model that knows when to call GPT-5 outperforms GPT-5 trying to handle everything alone.
NVNVIDIA Research
ToolOrchestra Paper, 2025
European Logistics Company
50,000 invoices processed daily
Before (Single-Model Approach): Every invoice routed to GPT-5 API. Cost per invoice: $0.03. Daily cost: $1,500. Monthly: $45,000. All invoice data sent to external servers.
After (Orchestrated Approach): 75% of standard invoices processed by fine-tuned 7B model on-premise ($0.002 each). 15% complex layouts routed to Claude ($0.025). 8% flagged for human review. 2% edge cases escalated to GPT-5.
70%
Cost Reduction
$378K
Annual Savings
75%
Data Stays Local
99.2%
Accuracy Maintained
| Failure Mode | Why It Happens | Mitigation |
|---|---|---|
| Ambiguous queries | Unclear which tool/domain applies | Add clarification step, confidence thresholds |
| Deep context tasks | Multi-turn conversations lose state | Maintain shared context layer |
| Real-time latency | Routing adds ~50-100ms overhead | Direct routing for latency-critical paths |
| Novel task types | Orchestrator hasn't seen this pattern | Fallback to frontier model, retrain |
| Cross-domain reasoning | Task spans multiple specialties | Chain multiple tools, or escalate |
When to Skip Orchestration: Prototyping (simplicity matters more than cost), highly novel exploratory tasks (routing patterns unknown), sub-100ms latency requirements (routing overhead hurts), very low volume (<1,000 queries/month).
Framework for chaining LLM calls and tools. Highly flexible, widely adopted.
✗ The LLM still decides routing - same self-selection bias
Single model with autonomous tool use. Impressive demos, challenging in production.
✗ One model does everything - no heterogeneous efficiency
Purpose-trained routing model + heterogeneous pool. Optimised for cost and accuracy.
✓Routing layer is separate, trained for efficiency
| Approach | Monthly Cost | Annual Cost | Latency | Sovereignty |
|---|---|---|---|---|
| GPT-5 for everything | $300,000 | $3.6M | 2.5s | ✗ None |
| Claude Opus for everything | $360,000 | $4.3M | 3.1s | ✗ None |
| Orchestrated (API-based) | $90,000 | $1.08M | 1.2s | △ Partial |
| Orchestrated (self-hosted) | $50,000* | $600K | 0.9s | ✓ Full |
*Infrastructure costs only - no per-token API fees for majority of queries
Every component of the orchestration blueprint maps to a Katonic product. Deploy Nemotron-Orchestrator-8B, connect 50+ tools, manage your model pool, and monitor everything - all on your infrastructure.
Deploy orchestrator via NVIDIA NIM, vLLM serving
50+ pre-built tool connectors, enterprise systems
Multi-turn context management across tool calls
Cost attribution, routing analytics, latency tracking
Economics demand orchestration at this scale
✓ OrchestrateBanking, healthcare - data can't leave
✓ Self-hosted orchestrationSimplicity and speed to learn matter most
→ Single frontier modelNeed to optimise spend at scale
✓ Aggressive orchestrationRouting overhead too high
→ Direct model callCost isn't the constraint
→ Frontier modelThe 80/20 Question: "Can 80% of your queries be handled by a model 10× cheaper than your current default?" If yes → orchestration will transform your economics.
Sample 1,000 production queries. Classify by actual complexity: what % are simple lookups? Calculations? Genuinely need frontier reasoning? Most enterprises find 60-80% are over-served.
Pick your highest-volume, most routine AI workflow. Document processing, query routing, data extraction. Implement orchestration there first. Measure cost, accuracy, latency.
Use pilot metrics to build the business case. Expand orchestration to additional workflows. Track ROI monthly. Iterate on routing logic.

Katonic AI
Engineering Team
Katonic AI provides enterprise-grade AI platforms that enable organisations to deploy, manage, and scale AI agents on their own infrastructure. Our orchestration-first approach helps enterprises achieve 70% cost savings while maintaining full data sovereignty.
Schedule an architecture review →§ Related articles
See how Katonic's orchestration stack can reduce your AI costs by 70% while keeping your data sovereign.
