
The Model That Changed AI Coding
On June 20, 2024, Anthropic released Claude 3.5 Sonnet—a model that immediately became the benchmark for AI coding and established Anthropic as a serious competitor to OpenAI. Updated on October 22, 2024 with significant improvements, Claude 3.5 Sonnet set a new standard for what AI models can do with code.
The surprise wasn't just the benchmarks—it was the practical coding experience. Developers reported that Claude 3.5 Sonnet consistently produced better, more complete code than GPT-4o, with fewer errors and better understanding of context.
Benchmark Performance
Claude 3.5 Sonnet's benchmarks on release:
| Benchmark | Claude 3.5 Sonnet | GPT-4o | Gemini 1.5 Pro | Claude 3 Opus |
|---|---|---|---|---|
| MMLU | 88.7% | 87.2% | 85.9% | 86.8% |
| HumanEval | 92.0% | 90.2% | 84.1% | 84.9% |
| MATH | 71.1% | 76.6% | 67.7% | 60.1% |
| GPQA | 65.0% | 53.6% | 46.2% | 50.4% |
| MGSM | 91.6% | 90.5% | 88.7% | 90.7% |
| SWE-bench Verified | 49.0% | 33.2% | 24.1% | 22.0% |
The SWE-bench Verified score of 49.0% is the standout metric. This benchmark tests whether AI can autonomously resolve real GitHub issues—reading codebases, understanding bugs, writing fixes, and passing tests. Claude 3.5 Sonnet resolves nearly half of real-world software issues without human intervention.
October Update: Even Better
The October 2024 update (claude-3-5-sonnet-20241022) improved significantly:
| Metric | June Release | October Update | Improvement |
|---|---|---|---|
| SWE-bench | 33.4% | 49.0% | +47% |
| TAU-bench (Airline) | 36.0% | 54.0% | +50% |
| Agentic Coding | Good | Best-in-class | Major leap |
| Computer Use | N/A | Supported | New capability |
The October update also introduced computer use—the ability to see and interact with computer interfaces through screenshots.
Why Developers Prefer Claude 3.5 Sonnet
Beyond benchmarks, qualitative factors drive developer adoption:
1. Code Completeness: Claude generates more complete solutions. Where GPT-4o might provide a skeleton or pseudocode, Claude produces working, runnable code.
2. Context Understanding: Better at understanding large codebases. Given multiple files, Claude tracks dependencies, imports, and type relationships more accurately.
3. Error Handling: Generates more robust error handling by default. Production-ready code rather than demo-quality snippets.
4. Instruction Following: More precise at following complex specifications. Less "creative interpretation" of requirements.
1# Example: Claude 3.5 Sonnet generates production-ready code
2# Prompt: "Create a retry decorator with exponential backoff"
3
4import functools
5import time
6import logging
7from typing import Callable, Type, Tuple
8
9logger = logging.getLogger(__name__)
10
11def retry(
12 max_retries: int = 3,
13 base_delay: float = 1.0,
14 max_delay: float = 60.0,
15 backoff_factor: float = 2.0,
16 exceptions: Tuple[Type[Exception], ...] = (Exception,),
17):
18 def decorator(func: Callable):
19 @functools.wraps(func)
20 def wrapper(*args, **kwargs):
21 last_exception = None
22 for attempt in range(max_retries + 1):
23 try:
24 return func(*args, **kwargs)
25 except exceptions as e:
26 last_exception = e
27 if attempt == max_retries:
28 logger.error(
29 f"{func.__name__} failed after {max_retries + 1} "
30 f"attempts: {e}"
31 )
32 raise
33 delay = min(
34 base_delay * (backoff_factor ** attempt),
35 max_delay
36 )
37 logger.warning(
38 f"{func.__name__} attempt {attempt + 1} failed: {e}. "
39 f"Retrying in {delay:.1f}s..."
40 )
41 time.sleep(delay)
42 raise last_exception # Type checker satisfaction
43 return wrapper
44 return decoratorArtifacts: Interactive Development
Claude 3.5 Sonnet introduced "Artifacts"—interactive previews within the Claude interface:
- React components: Live preview of generated UI components
- HTML/CSS: Rendered web pages inline
- SVG graphics: Visual diagrams and charts
- Code sandboxes: Runnable code with output
- Documents: Formatted markdown with live editing
This transforms Claude from a "code printer" into a development environment where you can see, test, and iterate on generated code.
Pricing and Speed
| Model | Input (1M tokens) | Output (1M tokens) | Speed |
|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 | Fast |
| GPT-4o | $5.00 | $15.00 | Fast |
| Gemini 1.5 Pro | $1.25 | $5.00 | Medium |
| Claude 3 Opus | $15.00 | $75.00 | Slow |
Claude 3.5 Sonnet is positioned as the "best value" for coding—comparable price to GPT-4o with significantly better coding performance. It's also much faster and cheaper than Claude 3 Opus while outperforming it on most benchmarks.
Impact on the AI Coding Landscape
Claude 3.5 Sonnet's release shifted the AI coding conversation:
- IDE integration: Cursor IDE and others prioritized Claude integration
- Agentic coding: The 49% SWE-bench score proved AI can handle real engineering
- Anthropic's reputation: Established as the "coder's AI" company
- Competition: Pushed OpenAI to accelerate o1 reasoning model development
- Enterprise adoption: Many companies switched their coding AI pipelines to Claude
The model demonstrated that the AI coding race isn't about general intelligence—it's about specialized capability. Claude 3.5 Sonnet may not be the best at every task, but for software engineering, it set a new bar.
Sources: Anthropic Blog, Claude API Docs, SWE-bench


