OpenAI o3 and o4-mini: Tool-Using Reasoning Models

OpenAI o3 and o4-mini: Tool-Using Reasoning Models

Tool-Using Reasoning Models

In April 2025, OpenAI released o3 and o4-mini—reasoning models that can use tools during their thinking process. Unlike previous reasoning models (o1, o3-mini) that could only think internally, o3 and o4-mini can browse the web, execute code, analyze files, and generate images as part of their chain-of-thought reasoning.

This is a fundamental architectural shift. Previous models had to complete their reasoning before using tools. o3 interleaves thinking and tool use, enabling complex multi-step workflows that were previously impossible in a single model call.

How Tool-Integrated Reasoning Works

Traditional reasoning models follow a think-then-act pattern. o3 introduces think-act-think cycles:

Traditional (o1):
Think → Think → Think → Answer → [Tools available after]

o3 Architecture:
Think → Use Tool → Think about results → Use another tool → 
Think more → Generate code → Analyze output → Answer

Practical example:

User: "Which country has higher GDP growth - India or Vietnam? 
       Show me a chart of the last 5 years."

o3's reasoning chain:
1. [THINK] I need current GDP data for both countries
2. [BROWSE] Search for India GDP growth 2020-2024
3. [THINK] Found India data: 7.2%, -6.6%, 8.7%, 7.2%, 6.8%
4. [BROWSE] Search for Vietnam GDP growth 2020-2024
5. [THINK] Found Vietnam data: 2.9%, 2.6%, 8.0%, 5.1%, 6.5%
6. [CODE] Generate matplotlib chart comparing both
7. [THINK] India has higher average but more volatile
8. [ANSWER] Comprehensive analysis with chart

Benchmark Performance

o3 represents a significant leap in reasoning benchmarks:

Benchmarko3o1GPT-4oClaude 3.5
AIME 202588.9%79.2%26.7%32.1%
GPQA Diamond87.7%78.0%53.6%65.0%
SWE-bench Verified69.1%48.9%33.2%49.0%
Codeforces (Elo)272718919001200
ARC-AGI (semi)87.5%32.0%5.0%21.0%

The SWE-bench Verified score of 69.1% is remarkable—o3 can autonomously resolve nearly 7 out of 10 real GitHub issues, including understanding codebases, writing fixes, and running tests.

o4-mini: Efficient Reasoning

o4-mini offers a compelling efficiency trade-off:

Aspecto3o4-miniGPT-4o
SpeedSlowFastFast
Cost$$$$$$
AIME 202588.9%92.7%26.7%
MATH-50098.6%98.0%94.3%
CodingExcellentExcellentGood
Tool UseYesYesYes

Surprisingly, o4-mini outperforms o3 on AIME 2025 (92.7% vs 88.9%), suggesting that for math and competition problems, the smaller model's focused reasoning is more effective.

Codex Integration

OpenAI simultaneously launched Codex—a cloud-based software engineering agent powered by o3:

  • Autonomous coding: Assign tasks and Codex works independently
  • Git-integrated: Creates branches, writes code, runs tests, submits PRs
  • Sandboxed execution: Each task runs in an isolated environment
  • Multi-file understanding: Navigates entire repositories
  • Test-driven: Runs existing test suites to verify changes
# Codex task example
# User assigns: "Add pagination to the /api/users endpoint"
# 
# Codex autonomously:
# 1. Reads existing API code and tests
# 2. Implements cursor-based pagination
# 3. Adds query parameters (limit, cursor)
# 4. Updates tests
# 5. Runs test suite
# 6. Creates PR with description

Pricing and Access

ModelInput (per 1M tokens)Output (per 1M tokens)Reasoning Tokens
o3$10.00$40.00Included in output
o4-mini$1.10$4.40Included in output
GPT-4o$2.50$10.00N/A

o4-mini at $1.10/$4.40 per million tokens offers frontier-level reasoning at a fraction of o3's cost, making it the practical choice for most applications.

Impact on AI Development

The o3/o4-mini release signals several trends:

  1. Tool use is becoming native: Future models will think and act simultaneously
  2. Reasoning is the differentiator: Raw knowledge matters less than the ability to reason through problems
  3. Agentic AI is here: Models that can autonomously navigate codebases, browse the web, and execute multi-step workflows
  4. Efficiency gains: o4-mini shows smaller models can match or exceed larger ones on reasoning tasks
  5. Software engineering transformation: 69% SWE-bench means AI can handle the majority of routine software tasks

Developer Integration

Using o3's tool capabilities via the API:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="o3",
    input="Analyze the latest Python 3.13 release notes and create a summary",
    tools=[
        {"type": "web_search"},
        {"type": "code_interpreter"},
        {"type": "file_search"}
    ]
)

The unified tool interface means developers don't need to orchestrate tool calls manually—the model decides when and how to use tools as part of its reasoning process.

Sources: OpenAI o3/o4-mini, OpenAI Codex, OpenAI API Docs

Conclusion

The o3 and o4-mini release marks a pivotal moment in AI development: the transition from models that simply generate text to models that actively interact with the world while reasoning. Tool-integrated reasoning is the foundation for truly autonomous AI agents—systems that can research, code, analyze, and create without constant human guidance.

For developers, the practical implication is clear: build applications that leverage tool use, not just text generation. The most valuable AI applications in 2025 and beyond will be those that combine reasoning depth with real-world action capability.

Sources: OpenAI o3/o4-mini, OpenAI API, OpenAI Research