ToolOrchestra

37.1%

HLE Score (vs GPT-5: 35.1%)

2.5×

More Efficient than GPT-5

~30%

Cost vs GPT-5 on FRAMES

Abstract

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools are able to both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate the use of intelligent tools. ToolOrchestra makes explicit use of reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On τ²-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to previously unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

Overview

We introduce ToolOrchestra, a method for training small orchestrators that coordinate the use of intelligent tools. By using both tools and specialized models, ToolOrchestrator surpasses GPT-5 while is much more efficient. Given a task, the Orchestrator alternates between reasoning and tool calling in multiple turns to solve it. Orchestrator interacts with a diverse tool set, including basic tools (e.g., web search, code interpreter), specialized LLMs (e.g., coding models, math models) and generalist LLMs (e.g., GPT-5, Llama-Nemotron-Ultra-253B, Claude Opus 4.1). In training, Orchestrator is jointly optimized by outcome, efficiency and preference rewards via end-to-end reinforcement learning. To aid RL training, we develop an automatic pipeline to synthesize both environment and tool-call tasks at scale.

Main Results

Comparison of Orchestrator-8B with baselines (prompt-based LLMs). Llama-Nemotron-49B denotes Llama-3.3-Nemotron-Super-49B-v1. Cost in US cents, Latency in minutes, are averaged between HLE and Frames. Basic tools include domain functions, search and code interpreter.

On Humanity’s Last Exam, Orchestrator-8B achieves 37.1%, surpassing GPT-5 (35.1%) with only 30% monetary cost and 2.5x faster. On FRAMES and τ²-Bench, Orchestrator-8B consistently outperforms strong monolithic systems, demonstrating versatile reasoning and robust tool orchestration.

Analysis

Further, Orchestrator-8B invokes a diverse set of tools and does not exihibit strong biases.

The proportion of tool calls made by LLMs to solve a task (averaged across HLE, Frames and τ²-Bench). Qwen-32B refers to Qwen3-32B and Coder-32B refers to Qwen2.5-Coder-32B-Instruct. Compared to other strong foundation models, Orchestrator-8B makes more balanced tool calls, and does not exhibit strong biases toward a particular tool or model.

Key Insights

Instead of relying on a single powerful model, intelligence can be elevated by orchestrating multiple specialized models and tools.
Orchestrator-8B makes more balanced tool calls and does not exhibit strong biases toward a particular tool or model.
Orchestrator-8B demonstrates great generalization ability. During evaluation, we tested different tools that were unseen during training and it showed promising performance.
Orchestrator-8B exhibits remarkably better adherence to user preferences. This demonstrates the effectiveness of RL training with our introduced preference reward design.

In addition, Orchestrator-8B makes better use of compute at test time. It achieves better performance at the same cost, and reaches similar performance at lower cost.

Test time scale chart — The relationship between performance and cost. Compared to strong monolithic LLM systems, Orchestrator (ours) achieves the best cost-effectiveness.

With unseen tools, Orchestrator-8B consistently demonstrates significant improvements in both performance and cost.

Generalization performance of Orchestrator-8B on HLE, Frames and τ²-Bench.

Last but not least, we calculate the preference rewards Orchestrator-8B compared to other models. The results show that Orchestrator-8B is superior in following user instructions to use their preferred tools.

Preference performance comparison. The results show that Orchestrator-8B best adapts to user preference during test time.

Limitations

Scalability: We have not yet attempted to scale Orchestrator to larger model sizes (e.g., >8B), so it remains unclear whether its performance and efficiency advantages will persist at that scale.
Coverage: We have not evaluated Orchestrator across broader domains (e.g., code generation, web interaction), leaving open the question of how well it generalizes beyond the reasoning tasks studied here.

BibTeX

@misc{toolorchestra,
      title={ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration}, 
      author={Hongjin Su and Shizhe Diao and Ximing Lu and Mingjie Liu and Jiacheng Xu and Xin Dong and Yonggan Fu and Peter Belcak and Hanrong Ye and Hongxu Yin and Yi Dong and Evelina Bakhturina and Tao Yu and Yejin Choi and Jan Kautz and Pavlo Molchanov},
      year={2025},
      eprint={2511.21689},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.21689}, 
}