OpenAI o3 and o4-mini: Tool-Using Reasoning Models

OpenAI o3 and o4-mini: Tool-Using Reasoning Models

Tool-Using Reasoning Models

In April 2025, OpenAI released o3 and o4-mini—reasoning models that can use tools during their thinking process. Unlike previous reasoning models (o1, o3-mini) that could only think internally, o3 and o4-mini can browse the web, execute code, analyze files, and generate images as part of their chain-of-thought reasoning.

This is a fundamental architectural shift. Previous models had to complete their reasoning before using tools. o3 interleaves thinking and tool use, enabling complex multi-step workflows that were previously impossible in a single model call.

How Tool-Integrated Reasoning Works

Traditional reasoning models follow a think-then-act pattern. o3 introduces think-act-think cycles:

text
1Traditional (o1):
2Think → Think → Think → Answer → [Tools available after]
3
4o3 Architecture:
5Think → Use Tool → Think about results → Use another tool → 
6Think more → Generate code → Analyze output → Answer

Practical example:

text
1User: "Which country has higher GDP growth - India or Vietnam? 
2       Show me a chart of the last 5 years."
3
4o3's reasoning chain:
51. [THINK] I need current GDP data for both countries
62. [BROWSE] Search for India GDP growth 2020-2024
73. [THINK] Found India data: 7.2%, -6.6%, 8.7%, 7.2%, 6.8%
84. [BROWSE] Search for Vietnam GDP growth 2020-2024
95. [THINK] Found Vietnam data: 2.9%, 2.6%, 8.0%, 5.1%, 6.5%
106. [CODE] Generate matplotlib chart comparing both
117. [THINK] India has higher average but more volatile
128. [ANSWER] Comprehensive analysis with chart

Benchmark Performance

o3 represents a significant leap in reasoning benchmarks:

Benchmarko3o1GPT-4oClaude 3.5
AIME 202588.9%79.2%26.7%32.1%
GPQA Diamond87.7%78.0%53.6%65.0%
SWE-bench Verified69.1%48.9%33.2%49.0%
Codeforces (Elo)272718919001200
ARC-AGI (semi)87.5%32.0%5.0%21.0%

The SWE-bench Verified score of 69.1% is remarkable—o3 can autonomously resolve nearly 7 out of 10 real GitHub issues, including understanding codebases, writing fixes, and running tests.

o4-mini: Efficient Reasoning

o4-mini offers a compelling efficiency trade-off:

Aspecto3o4-miniGPT-4o
SpeedSlowFastFast
Cost$$$$$$
AIME 202588.9%92.7%26.7%
MATH-50098.6%98.0%94.3%
CodingExcellentExcellentGood
Tool UseYesYesYes

Surprisingly, o4-mini outperforms o3 on AIME 2025 (92.7% vs 88.9%), suggesting that for math and competition problems, the smaller model's focused reasoning is more effective.

Codex Integration

OpenAI simultaneously launched Codex—a cloud-based software engineering agent powered by o3:

  • Autonomous coding: Assign tasks and Codex works independently
  • Git-integrated: Creates branches, writes code, runs tests, submits PRs
  • Sandboxed execution: Each task runs in an isolated environment
  • Multi-file understanding: Navigates entire repositories
  • Test-driven: Runs existing test suites to verify changes
python
1# Codex task example
2# User assigns: "Add pagination to the /api/users endpoint"
3# 
4# Codex autonomously:
5# 1. Reads existing API code and tests
6# 2. Implements cursor-based pagination
7# 3. Adds query parameters (limit, cursor)
8# 4. Updates tests
9# 5. Runs test suite
10# 6. Creates PR with description

Pricing and Access

ModelInput (per 1M tokens)Output (per 1M tokens)Reasoning Tokens
o3$10.00$40.00Included in output
o4-mini$1.10$4.40Included in output
GPT-4o$2.50$10.00N/A

o4-mini at $1.10/$4.40 per million tokens offers frontier-level reasoning at a fraction of o3's cost, making it the practical choice for most applications.

Impact on AI Development

The o3/o4-mini release signals several trends:

  1. Tool use is becoming native: Future models will think and act simultaneously
  2. Reasoning is the differentiator: Raw knowledge matters less than the ability to reason through problems
  3. Agentic AI is here: Models that can autonomously navigate codebases, browse the web, and execute multi-step workflows
  4. Efficiency gains: o4-mini shows smaller models can match or exceed larger ones on reasoning tasks
  5. Software engineering transformation: 69% SWE-bench means AI can handle the majority of routine software tasks

Developer Integration

Using o3's tool capabilities via the API:

python
1from openai import OpenAI
2
3client = OpenAI()
4
5response = client.responses.create(
6    model="o3",
7    input="Analyze the latest Python 3.13 release notes and create a summary",
8    tools=[
9        {"type": "web_search"},
10        {"type": "code_interpreter"},
11        {"type": "file_search"}
12    ]
13)

The unified tool interface means developers don't need to orchestrate tool calls manually—the model decides when and how to use tools as part of its reasoning process.

Sources: OpenAI o3/o4-mini, OpenAI Codex, OpenAI API Docs

Conclusion

The o3 and o4-mini release marks a pivotal moment in AI development: the transition from models that simply generate text to models that actively interact with the world while reasoning. Tool-integrated reasoning is the foundation for truly autonomous AI agents—systems that can research, code, analyze, and create without constant human guidance.

For developers, the practical implication is clear: build applications that leverage tool use, not just text generation. The most valuable AI applications in 2025 and beyond will be those that combine reasoning depth with real-world action capability.

Sources: OpenAI o3/o4-mini, OpenAI API, OpenAI Research