
The Omni Model
On May 13, 2024, OpenAI released GPT-4o ("omni")—a unified model that processes text, vision, and audio natively in a single architecture. Unlike GPT-4V (which added vision to a text model) or the voice mode in ChatGPT (which chained separate speech-to-text and text-to-speech models), GPT-4o is trained end-to-end across all modalities.
The "omni" means everything happens in one model: you can speak to it, show it an image, get a spoken response with emotion, and have it switch between modalities seamlessly. The demo was jaw-dropping—real-time conversation with emotional inflection, laughter, and singing.
Why "Omni" Matters
Previous AI systems chained multiple models together:
1GPT-4 Voice Mode (before GPT-4o):
2Audio Input → Whisper (STT) → GPT-4 (text) → TTS → Audio Output
3Latency: 2-5 seconds per exchange
4Lost: tone, emotion, non-verbal cues, interruptions
5
6GPT-4o Architecture:
7Audio/Image/Text Input → Single Model → Audio/Image/Text Output
8Latency: 320ms average (human conversation speed)
9Preserved: tone, emotion, pacing, visual contextThis unified architecture enables capabilities impossible with chained models:
- Real-time interruption: Cut off GPT-4o mid-sentence, it responds naturally
- Emotional tone: Expresses excitement, empathy, humor through voice
- Visual + audio: "What am I looking at?" while showing your phone camera
- Code narration: Explains code while you scroll through it
Benchmark Performance
GPT-4o matched GPT-4 Turbo on text while adding vision and audio at 2x speed and 50% lower cost:
| Benchmark | GPT-4o | GPT-4 Turbo | Claude 3 Opus | Gemini 1.5 Pro |
|---|---|---|---|---|
| MMLU | 88.7% | 86.4% | 86.8% | 85.9% |
| HumanEval | 90.2% | 86.6% | 84.9% | 84.1% |
| MATH | 76.6% | 72.6% | 60.1% | 67.7% |
| GPQA | 53.6% | 49.9% | 50.4% | 46.2% |
| Multilingual MMLU | 85.7% | 83.7% | 82.1% | 84.2% |
| Vision (MMMU) | 69.1% | 63.1% | 59.4% | 62.2% |
The multilingual performance is notable—GPT-4o significantly improves non-English capabilities, making it more accessible globally.
Pricing Revolution
GPT-4o's pricing was the most disruptive aspect:
| Model | Input (1M tokens) | Output (1M tokens) | Speed |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | Fast |
| GPT-4 Turbo | $10.00 | $30.00 | Medium |
| Claude 3 Opus | $15.00 | $75.00 | Slow |
| Gemini 1.5 Pro | $7.00 | $21.00 | Medium |
50% cheaper than GPT-4 Turbo with better performance. This pricing pressure forced competitors to adjust: Anthropic released Claude 3.5 Sonnet (cheaper than Opus, better performance) and Google cut Gemini API prices.
The Voice Demo That Went Viral
The May 2024 demo showcased GPT-4o's voice capabilities:
- Real-time conversation: Response latency of 320ms average (vs. 2-5s before)
- Emotional range: Laughter, excitement, sarcasm, dramatic reading
- Singing: GPT-4o sang "Happy Birthday" with pitch variation
- Camera integration: Live commentary on what the camera sees
- Math tutoring: Guided a student through a problem without giving answers
1# Using GPT-4o's multimodal API
2from openai import OpenAI
3
4client = OpenAI()
5
6# Vision: analyze an image
7response = client.chat.completions.create(
8 model="gpt-4o",
9 messages=[{
10 "role": "user",
11 "content": [
12 {"type": "text", "text": "What's wrong with this code?"},
13 {"type": "image_url", "image_url": {
14 "url": "data:image/png;base64,..." # Screenshot
15 }}
16 ]
17 }]
18)
19
20# Audio: real-time voice conversation
21# (via Realtime API, released later)
22from openai import OpenAI
23
24client = OpenAI()
25# Audio input/output through Realtime API WebSocketChatGPT Desktop App
Alongside GPT-4o, OpenAI launched a macOS desktop app:
- Keyboard shortcut: Option+Space to invoke anywhere
- Screenshot analysis: Share your screen for AI analysis
- Voice mode: Hands-free conversation
- Code assistance: Paste code, get explanations and fixes
- System-wide: Works alongside any application
GPT-4o mini: The Efficiency King
Released July 18, 2024, GPT-4o mini became the default model for free ChatGPT users:
| Metric | GPT-4o mini | GPT-3.5 Turbo | GPT-4o |
|---|---|---|---|
| MMLU | 82.0% | 70.0% | 88.7% |
| HumanEval | 87.2% | 48.1% | 90.2% |
| MATH | 70.2% | 34.1% | 76.6% |
| Input cost | $0.15/1M | $0.50/1M | $5.00/1M |
| Output cost | $0.60/1M | $1.50/1M | $15.00/1M |
GPT-4o mini is 33x cheaper than GPT-4o while being significantly better than GPT-3.5 Turbo. This effectively killed GPT-3.5 as a viable choice and made GPT-4-class performance accessible to every developer.
The Realtime API
In October 2024, OpenAI released the Realtime API—enabling GPT-4o's voice capabilities for developers:
1// WebSocket-based Realtime API
2const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
3 headers: {
4 'Authorization': `Bearer ${API_KEY}`,
5 'OpenAI-Beta': 'realtime=v1'
6 }
7});
8
9ws.on('open', () => {
10 ws.send(JSON.stringify({
11 type: 'session.update',
12 session: {
13 model: 'gpt-4o-realtime-preview',
14 voice: 'alloy',
15 instructions: 'You are a helpful customer service agent.',
16 tools: [{ type: 'function', function: lookupOrder }]
17 }
18 }));
19});
20
21// Send audio chunks directly
22ws.send(JSON.stringify({
23 type: 'input_audio_buffer.append',
24 audio: base64AudioChunk
25}));The Realtime API enables:
- Voice assistants: Build Siri/Alexa-quality assistants
- Call centers: AI agents handling customer calls
- Translation: Real-time spoken translation
- Accessibility: Voice-controlled applications
- Education: Interactive tutoring with voice
Impact on the AI Industry
GPT-4o's release marked several industry shifts:
- Multimodal becomes standard: Every major AI lab now builds unified multimodal models
- Price compression: 50% reduction forced industry-wide price cuts
- Voice AI: The Her-movie vision became technically achievable
- Free tier quality: GPT-4o mini gives free users GPT-4-class performance
- Developer accessibility: Lower prices opened AI to smaller projects and startups
GPT-4o demonstrated that the future of AI isn't text-only—it's a seamless blend of text, vision, and voice in a single, fast, affordable model.
Sources: OpenAI GPT-4o, OpenAI API Docs, GPT-4o mini


