GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

GPT-4o: OpenAI's Omni Model Processes Text, Audio, and Vision Together

The Omni Model

On May 13, 2024, OpenAI released GPT-4o ("omni")—a unified model that processes text, vision, and audio natively in a single architecture. Unlike GPT-4V (which added vision to a text model) or the voice mode in ChatGPT (which chained separate speech-to-text and text-to-speech models), GPT-4o is trained end-to-end across all modalities.

The "omni" means everything happens in one model: you can speak to it, show it an image, get a spoken response with emotion, and have it switch between modalities seamlessly. The demo was jaw-dropping—real-time conversation with emotional inflection, laughter, and singing.

Why "Omni" Matters

Previous AI systems chained multiple models together:

text
1GPT-4 Voice Mode (before GPT-4o):
2Audio Input → Whisper (STT) → GPT-4 (text) → TTS → Audio Output
3Latency: 2-5 seconds per exchange
4Lost: tone, emotion, non-verbal cues, interruptions
5
6GPT-4o Architecture:
7Audio/Image/Text Input → Single Model → Audio/Image/Text Output
8Latency: 320ms average (human conversation speed)
9Preserved: tone, emotion, pacing, visual context

This unified architecture enables capabilities impossible with chained models:

  • Real-time interruption: Cut off GPT-4o mid-sentence, it responds naturally
  • Emotional tone: Expresses excitement, empathy, humor through voice
  • Visual + audio: "What am I looking at?" while showing your phone camera
  • Code narration: Explains code while you scroll through it

Benchmark Performance

GPT-4o matched GPT-4 Turbo on text while adding vision and audio at 2x speed and 50% lower cost:

BenchmarkGPT-4oGPT-4 TurboClaude 3 OpusGemini 1.5 Pro
MMLU88.7%86.4%86.8%85.9%
HumanEval90.2%86.6%84.9%84.1%
MATH76.6%72.6%60.1%67.7%
GPQA53.6%49.9%50.4%46.2%
Multilingual MMLU85.7%83.7%82.1%84.2%
Vision (MMMU)69.1%63.1%59.4%62.2%

The multilingual performance is notable—GPT-4o significantly improves non-English capabilities, making it more accessible globally.

Pricing Revolution

GPT-4o's pricing was the most disruptive aspect:

ModelInput (1M tokens)Output (1M tokens)Speed
GPT-4o$5.00$15.00Fast
GPT-4 Turbo$10.00$30.00Medium
Claude 3 Opus$15.00$75.00Slow
Gemini 1.5 Pro$7.00$21.00Medium

50% cheaper than GPT-4 Turbo with better performance. This pricing pressure forced competitors to adjust: Anthropic released Claude 3.5 Sonnet (cheaper than Opus, better performance) and Google cut Gemini API prices.

The Voice Demo That Went Viral

The May 2024 demo showcased GPT-4o's voice capabilities:

  1. Real-time conversation: Response latency of 320ms average (vs. 2-5s before)
  2. Emotional range: Laughter, excitement, sarcasm, dramatic reading
  3. Singing: GPT-4o sang "Happy Birthday" with pitch variation
  4. Camera integration: Live commentary on what the camera sees
  5. Math tutoring: Guided a student through a problem without giving answers
python
1# Using GPT-4o's multimodal API
2from openai import OpenAI
3
4client = OpenAI()
5
6# Vision: analyze an image
7response = client.chat.completions.create(
8    model="gpt-4o",
9    messages=[{
10        "role": "user",
11        "content": [
12            {"type": "text", "text": "What's wrong with this code?"},
13            {"type": "image_url", "image_url": {
14                "url": "data:image/png;base64,..."  # Screenshot
15            }}
16        ]
17    }]
18)
19
20# Audio: real-time voice conversation
21# (via Realtime API, released later)
22from openai import OpenAI
23
24client = OpenAI()
25# Audio input/output through Realtime API WebSocket

ChatGPT Desktop App

Alongside GPT-4o, OpenAI launched a macOS desktop app:

  • Keyboard shortcut: Option+Space to invoke anywhere
  • Screenshot analysis: Share your screen for AI analysis
  • Voice mode: Hands-free conversation
  • Code assistance: Paste code, get explanations and fixes
  • System-wide: Works alongside any application

GPT-4o mini: The Efficiency King

Released July 18, 2024, GPT-4o mini became the default model for free ChatGPT users:

MetricGPT-4o miniGPT-3.5 TurboGPT-4o
MMLU82.0%70.0%88.7%
HumanEval87.2%48.1%90.2%
MATH70.2%34.1%76.6%
Input cost$0.15/1M$0.50/1M$5.00/1M
Output cost$0.60/1M$1.50/1M$15.00/1M

GPT-4o mini is 33x cheaper than GPT-4o while being significantly better than GPT-3.5 Turbo. This effectively killed GPT-3.5 as a viable choice and made GPT-4-class performance accessible to every developer.

The Realtime API

In October 2024, OpenAI released the Realtime API—enabling GPT-4o's voice capabilities for developers:

javascript
1// WebSocket-based Realtime API
2const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
3    headers: {
4        'Authorization': `Bearer ${API_KEY}`,
5        'OpenAI-Beta': 'realtime=v1'
6    }
7});
8
9ws.on('open', () => {
10    ws.send(JSON.stringify({
11        type: 'session.update',
12        session: {
13            model: 'gpt-4o-realtime-preview',
14            voice: 'alloy',
15            instructions: 'You are a helpful customer service agent.',
16            tools: [{ type: 'function', function: lookupOrder }]
17        }
18    }));
19});
20
21// Send audio chunks directly
22ws.send(JSON.stringify({
23    type: 'input_audio_buffer.append',
24    audio: base64AudioChunk
25}));

The Realtime API enables:

  • Voice assistants: Build Siri/Alexa-quality assistants
  • Call centers: AI agents handling customer calls
  • Translation: Real-time spoken translation
  • Accessibility: Voice-controlled applications
  • Education: Interactive tutoring with voice

Impact on the AI Industry

GPT-4o's release marked several industry shifts:

  1. Multimodal becomes standard: Every major AI lab now builds unified multimodal models
  2. Price compression: 50% reduction forced industry-wide price cuts
  3. Voice AI: The Her-movie vision became technically achievable
  4. Free tier quality: GPT-4o mini gives free users GPT-4-class performance
  5. Developer accessibility: Lower prices opened AI to smaller projects and startups

GPT-4o demonstrated that the future of AI isn't text-only—it's a seamless blend of text, vision, and voice in a single, fast, affordable model.

Sources: OpenAI GPT-4o, OpenAI API Docs, GPT-4o mini