GPT-4o: OpenAI's Multimodal AI Model

The Omni Model

On May 13, 2024, OpenAI released GPT-4o ("omni")—a unified model that processes text, vision, and audio natively in a single architecture. Unlike GPT-4V (which added vision to a text model) or the voice mode in ChatGPT (which chained separate speech-to-text and text-to-speech models), GPT-4o is trained end-to-end across all modalities.

The "omni" means everything happens in one model: you can speak to it, show it an image, get a spoken response with emotion, and have it switch between modalities seamlessly. The demo was jaw-dropping—real-time conversation with emotional inflection, laughter, and singing.

Why "Omni" Matters

Previous AI systems chained multiple models together:

text
GPT-4 Voice Mode (before GPT-4o):
Audio Input → Whisper (STT) → GPT-4 (text) → TTS → Audio Output
Latency: 2-5 seconds per exchange
Lost: tone, emotion, non-verbal cues, interruptions

GPT-4o Architecture:
Audio/Image/Text Input → Single Model → Audio/Image/Text Output
Latency: 320ms average (human conversation speed)
Preserved: tone, emotion, pacing, visual context

This unified architecture enables capabilities impossible with chained models:

Real-time interruption: Cut off GPT-4o mid-sentence, it responds naturally
Emotional tone: Expresses excitement, empathy, humor through voice
Visual + audio: "What am I looking at?" while showing your phone camera
Code narration: Explains code while you scroll through it

Benchmark Performance

GPT-4o matched GPT-4 Turbo on text while adding vision and audio at 2x speed and 50% lower cost:

Benchmark	GPT-4o	GPT-4 Turbo	Claude 3 Opus	Gemini 1.5 Pro
MMLU	88.7%	86.4%	86.8%	85.9%
HumanEval	90.2%	86.6%	84.9%	84.1%
MATH	76.6%	72.6%	60.1%	67.7%
GPQA	53.6%	49.9%	50.4%	46.2%
Multilingual MMLU	85.7%	83.7%	82.1%	84.2%
Vision (MMMU)	69.1%	63.1%	59.4%	62.2%

The multilingual performance is notable—GPT-4o significantly improves non-English capabilities, making it more accessible globally.

Pricing Revolution

GPT-4o's pricing was the most disruptive aspect:

Model	Input (1M tokens)	Output (1M tokens)	Speed
GPT-4o	$5.00	$15.00	Fast
GPT-4 Turbo	$10.00	$30.00	Medium
Claude 3 Opus	$15.00	$75.00	Slow
Gemini 1.5 Pro	$7.00	$21.00	Medium

50% cheaper than GPT-4 Turbo with better performance. This pricing pressure forced competitors to adjust: Anthropic released Claude 3.5 Sonnet (cheaper than Opus, better performance) and Google cut Gemini API prices.

The Voice Demo That Went Viral

The May 2024 demo showcased GPT-4o's voice capabilities:

Real-time conversation: Response latency of 320ms average (vs. 2-5s before)
Emotional range: Laughter, excitement, sarcasm, dramatic reading
Singing: GPT-4o sang "Happy Birthday" with pitch variation
Camera integration: Live commentary on what the camera sees
Math tutoring: Guided a student through a problem without giving answers

python
# Using GPT-4o's multimodal API
from openai import OpenAI

client = OpenAI()

# Vision: analyze an image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's wrong with this code?"},
            {"type": "image_url", "image_url": {
                "url": "data:image/png;base64,..."  # Screenshot
            }}
        ]
    }]
)

# Audio: real-time voice conversation
# (via Realtime API, released later)
from openai import OpenAI

client = OpenAI()
# Audio input/output through Realtime API WebSocket

ChatGPT Desktop App

Alongside GPT-4o, OpenAI launched a macOS desktop app:

Keyboard shortcut: Option+Space to invoke anywhere
Screenshot analysis: Share your screen for AI analysis
Voice mode: Hands-free conversation
Code assistance: Paste code, get explanations and fixes
System-wide: Works alongside any application

GPT-4o mini: The Efficiency King

Released July 18, 2024, GPT-4o mini became the default model for free ChatGPT users:

Metric	GPT-4o mini	GPT-3.5 Turbo	GPT-4o
MMLU	82.0%	70.0%	88.7%
HumanEval	87.2%	48.1%	90.2%
MATH	70.2%	34.1%	76.6%
Input cost	$0.15/1M	$0.50/1M	$5.00/1M
Output cost	$0.60/1M	$1.50/1M	$15.00/1M

GPT-4o mini is 33x cheaper than GPT-4o while being significantly better than GPT-3.5 Turbo. This effectively killed GPT-3.5 as a viable choice and made GPT-4-class performance accessible to every developer.

The Realtime API

In October 2024, OpenAI released the Realtime API—enabling GPT-4o's voice capabilities for developers:

javascript
// WebSocket-based Realtime API
const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
    headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'OpenAI-Beta': 'realtime=v1'
    }
});

ws.on('open', () => {
    ws.send(JSON.stringify({
        type: 'session.update',
        session: {
            model: 'gpt-4o-realtime-preview',
            voice: 'alloy',
            instructions: 'You are a helpful customer service agent.',
            tools: [{ type: 'function', function: lookupOrder }]
        }
    }));
});

// Send audio chunks directly
ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64AudioChunk
}));

The Realtime API enables:

Voice assistants: Build Siri/Alexa-quality assistants
Call centers: AI agents handling customer calls
Translation: Real-time spoken translation
Accessibility: Voice-controlled applications
Education: Interactive tutoring with voice

Impact on the AI Industry

GPT-4o's release marked several industry shifts:

Multimodal becomes standard: Every major AI lab now builds unified multimodal models
Price compression: 50% reduction forced industry-wide price cuts
Voice AI: The Her-movie vision became technically achievable
Free tier quality: GPT-4o mini gives free users GPT-4-class performance
Developer accessibility: Lower prices opened AI to smaller projects and startups

GPT-4o demonstrated that the future of AI isn't text-only—it's a seamless blend of text, vision, and voice in a single, fast, affordable model.

Sources: OpenAI GPT-4o, OpenAI API Docs, GPT-4o mini

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Let's Take the Next Step Together

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

Why "Omni" Matters

Benchmark Performance

Pricing Revolution

The Voice Demo That Went Viral

ChatGPT Desktop App

GPT-4o mini: The Efficiency King

The Realtime API

Impact on the AI Industry

Related Articles

Samsung Galaxy S26 Turns Your Phone Into an AI Agent

An AI Model Just Read 30,000 Brain MRIs with 97.5% Accuracy

OpenAI's Pentagon Deal: The Autonomous Weapons Debate That Split the AI Industry

Let's Take the Next Step Together