GPT-4o: OpenAI's Multimodal AI Model

The Omni Model

On May 13, 2024, OpenAI released GPT-4o ("omni")—a unified model that processes text, vision, and audio natively in a single architecture. Unlike GPT-4V (which added vision to a text model) or the voice mode in ChatGPT (which chained separate speech-to-text and text-to-speech models), GPT-4o is trained end-to-end across all modalities.

The "omni" means everything happens in one model: you can speak to it, show it an image, get a spoken response with emotion, and have it switch between modalities seamlessly. The demo was jaw-dropping—real-time conversation with emotional inflection, laughter, and singing.

Why "Omni" Matters

Previous AI systems chained multiple models together:

text
GPT-4 Voice Mode (before GPT-4o):
Audio Input → Whisper (STT) → GPT-4 (text) → TTS → Audio Output
Latency: 2-5 seconds per exchange
Lost: tone, emotion, non-verbal cues, interruptions

GPT-4o Architecture:
Audio/Image/Text Input → Single Model → Audio/Image/Text Output
Latency: 320ms average (human conversation speed)
Preserved: tone, emotion, pacing, visual context

This unified architecture enables capabilities impossible with chained models:

Real-time interruption: Cut off GPT-4o mid-sentence, it responds naturally
Emotional tone: Expresses excitement, empathy, humor through voice
Visual + audio: "What am I looking at?" while showing your phone camera
Code narration: Explains code while you scroll through it

Benchmark Performance

GPT-4o matched GPT-4 Turbo on text while adding vision and audio at 2x speed and 50% lower cost:

Benchmark	GPT-4o	GPT-4 Turbo	Claude 3 Opus	Gemini 1.5 Pro
MMLU	88.7%	86.4%	86.8%	85.9%
HumanEval	90.2%	86.6%	84.9%	84.1%
MATH	76.6%	72.6%	60.1%	67.7%
GPQA	53.6%	49.9%	50.4%	46.2%
Multilingual MMLU	85.7%	83.7%	82.1%	84.2%
Vision (MMMU)	69.1%	63.1%	59.4%	62.2%

The multilingual performance is notable—GPT-4o significantly improves non-English capabilities, making it more accessible globally.

Pricing Revolution

GPT-4o's pricing was the most disruptive aspect:

Model	Input (1M tokens)	Output (1M tokens)	Speed
GPT-4o	$5.00	$15.00	Fast
GPT-4 Turbo	$10.00	$30.00	Medium
Claude 3 Opus	$15.00	$75.00	Slow
Gemini 1.5 Pro	$7.00	$21.00	Medium

50% cheaper than GPT-4 Turbo with better performance. This pricing pressure forced competitors to adjust: Anthropic released Claude 3.5 Sonnet (cheaper than Opus, better performance) and Google cut Gemini API prices.

The Voice Demo That Went Viral

The May 2024 demo showcased GPT-4o's voice capabilities:

Real-time conversation: Response latency of 320ms average (vs. 2-5s before)
Emotional range: Laughter, excitement, sarcasm, dramatic reading
Singing: GPT-4o sang "Happy Birthday" with pitch variation
Camera integration: Live commentary on what the camera sees
Math tutoring: Guided a student through a problem without giving answers

python
# Using GPT-4o's multimodal API
from openai import OpenAI

client = OpenAI()

# Vision: analyze an image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's wrong with this code?"},
            {"type": "image_url", "image_url": {
                "url": "data:image/png;base64,..."  # Screenshot
            }}
        ]
    }]
)

# Audio: real-time voice conversation
# (via Realtime API, released later)
from openai import OpenAI

client = OpenAI()
# Audio input/output through Realtime API WebSocket

ChatGPT Desktop App

Alongside GPT-4o, OpenAI launched a macOS desktop app:

Keyboard shortcut: Option+Space to invoke anywhere
Screenshot analysis: Share your screen for AI analysis
Voice mode: Hands-free conversation
Code assistance: Paste code, get explanations and fixes
System-wide: Works alongside any application

GPT-4o mini: The Efficiency King

Released July 18, 2024, GPT-4o mini became the default model for free ChatGPT users:

Metric	GPT-4o mini	GPT-3.5 Turbo	GPT-4o
MMLU	82.0%	70.0%	88.7%
HumanEval	87.2%	48.1%	90.2%
MATH	70.2%	34.1%	76.6%
Input cost	$0.15/1M	$0.50/1M	$5.00/1M
Output cost	$0.60/1M	$1.50/1M	$15.00/1M

GPT-4o mini is 33x cheaper than GPT-4o while being significantly better than GPT-3.5 Turbo. This effectively killed GPT-3.5 as a viable choice and made GPT-4-class performance accessible to every developer.

The Realtime API

In October 2024, OpenAI released the Realtime API—enabling GPT-4o's voice capabilities for developers:

javascript
// WebSocket-based Realtime API
const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
    headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'OpenAI-Beta': 'realtime=v1'
    }
});

ws.on('open', () => {
    ws.send(JSON.stringify({
        type: 'session.update',
        session: {
            model: 'gpt-4o-realtime-preview',
            voice: 'alloy',
            instructions: 'You are a helpful customer service agent.',
            tools: [{ type: 'function', function: lookupOrder }]
        }
    }));
});

// Send audio chunks directly
ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64AudioChunk
}));

The Realtime API enables:

Voice assistants: Build Siri/Alexa-quality assistants
Call centers: AI agents handling customer calls
Translation: Real-time spoken translation
Accessibility: Voice-controlled applications
Education: Interactive tutoring with voice

Impact on the AI Industry

GPT-4o's release marked several industry shifts:

Multimodal becomes standard: Every major AI lab now builds unified multimodal models
Price compression: 50% reduction forced industry-wide price cuts
Voice AI: The Her-movie vision became technically achievable
Free tier quality: GPT-4o mini gives free users GPT-4-class performance
Developer accessibility: Lower prices opened AI to smaller projects and startups

GPT-4o demonstrated that the future of AI isn't text-only—it's a seamless blend of text, vision, and voice in a single, fast, affordable model.

Sources: OpenAI GPT-4o, OpenAI API Docs, GPT-4o mini

The Omni Model

Why "Omni" Matters

Previous AI systems chained multiple models together:

text
GPT-4 Voice Mode (before GPT-4o):
Audio Input → Whisper (STT) → GPT-4 (text) → TTS → Audio Output
Latency: 2-5 seconds per exchange
Lost: tone, emotion, non-verbal cues, interruptions

GPT-4o Architecture:
Audio/Image/Text Input → Single Model → Audio/Image/Text Output
Latency: 320ms average (human conversation speed)
Preserved: tone, emotion, pacing, visual context

This unified architecture enables capabilities impossible with chained models:

Real-time interruption: Cut off GPT-4o mid-sentence, it responds naturally
Emotional tone: Expresses excitement, empathy, humor through voice
Visual + audio: "What am I looking at?" while showing your phone camera
Code narration: Explains code while you scroll through it

Benchmark Performance

GPT-4o matched GPT-4 Turbo on text while adding vision and audio at 2x speed and 50% lower cost:

Benchmark	GPT-4o	GPT-4 Turbo	Claude 3 Opus	Gemini 1.5 Pro
MMLU	88.7%	86.4%	86.8%	85.9%
HumanEval	90.2%	86.6%	84.9%	84.1%
MATH	76.6%	72.6%	60.1%	67.7%
GPQA	53.6%	49.9%	50.4%	46.2%
Multilingual MMLU	85.7%	83.7%	82.1%	84.2%
Vision (MMMU)	69.1%	63.1%	59.4%	62.2%

The multilingual performance is notable—GPT-4o significantly improves non-English capabilities, making it more accessible globally.

Pricing Revolution

GPT-4o's pricing was the most disruptive aspect:

Model	Input (1M tokens)	Output (1M tokens)	Speed
GPT-4o	$5.00	$15.00	Fast
GPT-4 Turbo	$10.00	$30.00	Medium
Claude 3 Opus	$15.00	$75.00	Slow
Gemini 1.5 Pro	$7.00	$21.00	Medium

The Voice Demo That Went Viral

The May 2024 demo showcased GPT-4o's voice capabilities:

Real-time conversation: Response latency of 320ms average (vs. 2-5s before)
Emotional range: Laughter, excitement, sarcasm, dramatic reading
Singing: GPT-4o sang "Happy Birthday" with pitch variation
Camera integration: Live commentary on what the camera sees
Math tutoring: Guided a student through a problem without giving answers

python
# Using GPT-4o's multimodal API
from openai import OpenAI

client = OpenAI()

# Vision: analyze an image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's wrong with this code?"},
            {"type": "image_url", "image_url": {
                "url": "data:image/png;base64,..."  # Screenshot
            }}
        ]
    }]
)

# Audio: real-time voice conversation
# (via Realtime API, released later)
from openai import OpenAI

client = OpenAI()
# Audio input/output through Realtime API WebSocket

ChatGPT Desktop App

Alongside GPT-4o, OpenAI launched a macOS desktop app:

Keyboard shortcut: Option+Space to invoke anywhere
Screenshot analysis: Share your screen for AI analysis
Voice mode: Hands-free conversation
Code assistance: Paste code, get explanations and fixes
System-wide: Works alongside any application

GPT-4o mini: The Efficiency King

Released July 18, 2024, GPT-4o mini became the default model for free ChatGPT users:

Metric	GPT-4o mini	GPT-3.5 Turbo	GPT-4o
MMLU	82.0%	70.0%	88.7%
HumanEval	87.2%	48.1%	90.2%
MATH	70.2%	34.1%	76.6%
Input cost	$0.15/1M	$0.50/1M	$5.00/1M
Output cost	$0.60/1M	$1.50/1M	$15.00/1M

The Realtime API

In October 2024, OpenAI released the Realtime API—enabling GPT-4o's voice capabilities for developers:

javascript
// WebSocket-based Realtime API
const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
    headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'OpenAI-Beta': 'realtime=v1'
    }
});

ws.on('open', () => {
    ws.send(JSON.stringify({
        type: 'session.update',
        session: {
            model: 'gpt-4o-realtime-preview',
            voice: 'alloy',
            instructions: 'You are a helpful customer service agent.',
            tools: [{ type: 'function', function: lookupOrder }]
        }
    }));
});

// Send audio chunks directly
ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64AudioChunk
}));

The Realtime API enables:

Voice assistants: Build Siri/Alexa-quality assistants
Call centers: AI agents handling customer calls
Translation: Real-time spoken translation
Accessibility: Voice-controlled applications
Education: Interactive tutoring with voice

Impact on the AI Industry

GPT-4o's release marked several industry shifts:

Multimodal becomes standard: Every major AI lab now builds unified multimodal models
Price compression: 50% reduction forced industry-wide price cuts
Voice AI: The Her-movie vision became technically achievable
Free tier quality: GPT-4o mini gives free users GPT-4-class performance
Developer accessibility: Lower prices opened AI to smaller projects and startups

GPT-4o demonstrated that the future of AI isn't text-only—it's a seamless blend of text, vision, and voice in a single, fast, affordable model.

Sources: OpenAI GPT-4o, OpenAI API Docs, GPT-4o mini

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Let's Take the Next Step Together

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Let's Take the Next Step Together

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Related Articles

Healthcare Under Siege: Why Hospitals Are Prime Targets

Grok 4.2: The Multi-Agent AI That Debates Itself

Google I/O 2025: Gemini 2.5 Pro, AI Mode, and Jules Code Agent

Let's Take the Next Step Together

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Related Articles

Healthcare Under Siege: Why Hospitals Are Prime Targets

Grok 4.2: The Multi-Agent AI That Debates Itself

Google I/O 2025: Gemini 2.5 Pro, AI Mode, and Jules Code Agent

Let's Take the Next Step Together