
Tool-Using Reasoning Models
In April 2025, OpenAI released o3 and o4-mini—reasoning models that can use tools during their thinking process. Unlike previous reasoning models (o1, o3-mini) that could only think internally, o3 and o4-mini can browse the web, execute code, analyze files, and generate images as part of their chain-of-thought reasoning.
This is a fundamental architectural shift. Previous models had to complete their reasoning before using tools. o3 interleaves thinking and tool use, enabling complex multi-step workflows that were previously impossible in a single model call.
How Tool-Integrated Reasoning Works
Traditional reasoning models follow a think-then-act pattern. o3 introduces think-act-think cycles:
1Traditional (o1):
2Think → Think → Think → Answer → [Tools available after]
3
4o3 Architecture:
5Think → Use Tool → Think about results → Use another tool →
6Think more → Generate code → Analyze output → AnswerPractical example:
1User: "Which country has higher GDP growth - India or Vietnam?
2 Show me a chart of the last 5 years."
3
4o3's reasoning chain:
51. [THINK] I need current GDP data for both countries
62. [BROWSE] Search for India GDP growth 2020-2024
73. [THINK] Found India data: 7.2%, -6.6%, 8.7%, 7.2%, 6.8%
84. [BROWSE] Search for Vietnam GDP growth 2020-2024
95. [THINK] Found Vietnam data: 2.9%, 2.6%, 8.0%, 5.1%, 6.5%
106. [CODE] Generate matplotlib chart comparing both
117. [THINK] India has higher average but more volatile
128. [ANSWER] Comprehensive analysis with chartBenchmark Performance
o3 represents a significant leap in reasoning benchmarks:
| Benchmark | o3 | o1 | GPT-4o | Claude 3.5 |
|---|---|---|---|---|
| AIME 2025 | 88.9% | 79.2% | 26.7% | 32.1% |
| GPQA Diamond | 87.7% | 78.0% | 53.6% | 65.0% |
| SWE-bench Verified | 69.1% | 48.9% | 33.2% | 49.0% |
| Codeforces (Elo) | 2727 | 1891 | 900 | 1200 |
| ARC-AGI (semi) | 87.5% | 32.0% | 5.0% | 21.0% |
The SWE-bench Verified score of 69.1% is remarkable—o3 can autonomously resolve nearly 7 out of 10 real GitHub issues, including understanding codebases, writing fixes, and running tests.
o4-mini: Efficient Reasoning
o4-mini offers a compelling efficiency trade-off:
| Aspect | o3 | o4-mini | GPT-4o |
|---|---|---|---|
| Speed | Slow | Fast | Fast |
| Cost | $$$ | $ | $$ |
| AIME 2025 | 88.9% | 92.7% | 26.7% |
| MATH-500 | 98.6% | 98.0% | 94.3% |
| Coding | Excellent | Excellent | Good |
| Tool Use | Yes | Yes | Yes |
Surprisingly, o4-mini outperforms o3 on AIME 2025 (92.7% vs 88.9%), suggesting that for math and competition problems, the smaller model's focused reasoning is more effective.
Codex Integration
OpenAI simultaneously launched Codex—a cloud-based software engineering agent powered by o3:
- Autonomous coding: Assign tasks and Codex works independently
- Git-integrated: Creates branches, writes code, runs tests, submits PRs
- Sandboxed execution: Each task runs in an isolated environment
- Multi-file understanding: Navigates entire repositories
- Test-driven: Runs existing test suites to verify changes
1# Codex task example
2# User assigns: "Add pagination to the /api/users endpoint"
3#
4# Codex autonomously:
5# 1. Reads existing API code and tests
6# 2. Implements cursor-based pagination
7# 3. Adds query parameters (limit, cursor)
8# 4. Updates tests
9# 5. Runs test suite
10# 6. Creates PR with descriptionPricing and Access
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Reasoning Tokens |
|---|---|---|---|
| o3 | $10.00 | $40.00 | Included in output |
| o4-mini | $1.10 | $4.40 | Included in output |
| GPT-4o | $2.50 | $10.00 | N/A |
o4-mini at $1.10/$4.40 per million tokens offers frontier-level reasoning at a fraction of o3's cost, making it the practical choice for most applications.
Impact on AI Development
The o3/o4-mini release signals several trends:
- Tool use is becoming native: Future models will think and act simultaneously
- Reasoning is the differentiator: Raw knowledge matters less than the ability to reason through problems
- Agentic AI is here: Models that can autonomously navigate codebases, browse the web, and execute multi-step workflows
- Efficiency gains: o4-mini shows smaller models can match or exceed larger ones on reasoning tasks
- Software engineering transformation: 69% SWE-bench means AI can handle the majority of routine software tasks
Developer Integration
Using o3's tool capabilities via the API:
1from openai import OpenAI
2
3client = OpenAI()
4
5response = client.responses.create(
6 model="o3",
7 input="Analyze the latest Python 3.13 release notes and create a summary",
8 tools=[
9 {"type": "web_search"},
10 {"type": "code_interpreter"},
11 {"type": "file_search"}
12 ]
13)The unified tool interface means developers don't need to orchestrate tool calls manually—the model decides when and how to use tools as part of its reasoning process.
Sources: OpenAI o3/o4-mini, OpenAI Codex, OpenAI API Docs
Conclusion
The o3 and o4-mini release marks a pivotal moment in AI development: the transition from models that simply generate text to models that actively interact with the world while reasoning. Tool-integrated reasoning is the foundation for truly autonomous AI agents—systems that can research, code, analyze, and create without constant human guidance.
For developers, the practical implication is clear: build applications that leverage tool use, not just text generation. The most valuable AI applications in 2025 and beyond will be those that combine reasoning depth with real-world action capability.
Sources: OpenAI o3/o4-mini, OpenAI API, OpenAI Research


