
10 Million Token Context
On April 5, 2025, Meta released Llama 4 in two variants: Scout (17B active params, 109B total) and Maverick (17B active, 400B total MoE). These models represent Meta's most ambitious open-source AI release yet, introducing mixture-of-experts architecture to the Llama family for the first time.
The headline feature is Scout's 10 million token context window—the longest of any production model. To put this in perspective, that's roughly 30 full novels, an entire codebase of a large application, or years of conversation history processed in a single prompt.
Architecture: Mixture of Experts
Llama 4 marks Meta's transition from dense to Mixture of Experts (MoE) architecture:
| Specification | Scout | Maverick |
|---|---|---|
| Active Parameters | 17B | 17B |
| Total Parameters | 109B | 400B |
| Expert Count | 16 | 128 |
| Active Experts | 1 shared + 1 of 16 routed | 1 shared + top-1 of 128 routed |
| Context Window | 10M tokens | 1M tokens |
| Training Tokens | ~40T | ~40T |
How MoE works in Llama 4:
1Input Token
2 ↓
3Router Network (learned gating)
4 ↓
5┌─────────┬─────────┬─────────┬─────────┐
6│Expert 1 │Expert 2 │Expert 3 │ ...16 │ (Scout)
7└─────────┴─────────┴─────────┴─────────┘
8 ↓ (only 1 expert activated per token)
9Output Token
10
11Benefit: 109B total knowledge, 17B compute costThis means Scout runs with the computational cost of a 17B model while accessing the knowledge capacity of a 109B model. For inference, this translates to faster response times and lower hardware requirements than a comparably capable dense model.
Benchmark Performance
Llama 4 models are competitive with much larger models:
| Benchmark | Scout (17B/109B) | Maverick (17B/400B) | Gemma 3 27B | GPT-4o mini |
|---|---|---|---|---|
| MMLU | 79.6 | 85.5 | 82.1 | 86.2 |
| MATH-500 | 82.1 | 90.3 | 76.4 | 84.5 |
| HumanEval | 74.1 | 84.6 | 78.0 | 86.7 |
| IFEval | 84.5 | 88.7 | 79.2 | 83.1 |
| Multilingual MMLU | 80.1 | 84.6 | 78.5 | 83.4 |
Scout achieves near-GPT-4o-mini performance while being an open-weights model that can run locally. Maverick pushes into GPT-4o territory on several benchmarks.
10 Million Token Context: Technical Achievement
The 10M context window in Scout is achieved through several innovations:
- Hierarchical attention: Different layers attend at different scales
- Sparse attention patterns: Not all tokens attend to all other tokens
- Efficient KV-cache: Compressed key-value storage for long contexts
- Training curriculum: Gradually increased context length during training
Practical applications for 10M token context:
- Entire codebase analysis: Load a full repository and ask questions
- Legal document review: Process hundreds of contracts simultaneously
- Book-length content: Analyze or generate novel-length text
- Long conversation agents: Maintain context across weeks of interaction
- Multi-document research: Synthesize information from dozens of papers
Multimodal Capabilities
Both Scout and Maverick natively support vision tasks:
- Image understanding: Describe, analyze, and reason about images
- Document parsing: Extract information from PDFs and screenshots
- Chart/graph analysis: Interpret data visualizations
- Multi-image reasoning: Compare and analyze multiple images
Meta trained the vision capabilities natively rather than bolting on a separate vision encoder, resulting in more coherent multimodal reasoning.
Running Locally
One of Llama 4's biggest advantages is local deployment:
1# Using Ollama (easiest)
2ollama run llama4-scout
3
4# Using vLLM for production
5pip install vllm
6vllm serve meta-llama/Llama-4-Scout-17B-16E --tensor-parallel-size 2 --max-model-len 1048576
7
8# Using llama.cpp for quantized inference
9./llama-server -m llama-4-scout-Q4_K_M.gguf --ctx-size 131072 --n-gpu-layers 99Hardware requirements:
| Model | Precision | VRAM Required | Minimum GPU |
|---|---|---|---|
| Scout | FP16 | ~220 GB | 3x A100 80GB |
| Scout | Q4_K_M | ~60 GB | 1x A100 80GB |
| Scout | Q4_K_M | ~60 GB | 2x RTX 4090 |
| Maverick | FP16 | ~800 GB | 10x A100 80GB |
| Maverick | Q4_K_M | ~220 GB | 3x A100 80GB |
Open Source Impact
Llama 4 continues Meta's strategy of open-weights AI, with significant implications:
- Researchers can study MoE architectures at frontier scale
- Startups can build products without API dependency
- Enterprises can deploy on-premise for data privacy
- Developers can fine-tune for domain-specific tasks
The Llama license allows commercial use with some restrictions (companies with 700M+ monthly users need a special license).
Comparison with Competitors
| Feature | Llama 4 Scout | Mistral Mixtral | DeepSeek V3 | GPT-4o mini |
|---|---|---|---|---|
| Open Weights | Yes | Yes | Yes | No |
| Architecture | MoE | MoE | MoE | Unknown |
| Max Context | 10M | 32K | 128K | 128K |
| Local Deploy | Yes | Yes | Yes | No |
| Vision | Native | No | No | Yes |
| License | Llama License | Apache 2.0 | MIT | Proprietary |
What This Means for AI Development
Llama 4 demonstrates that open-source models can compete with proprietary ones on capability while offering superior flexibility. The MoE architecture shows that scaling doesn't require proportionally more compute at inference time, and the 10M context window opens entirely new application categories.
Sources: Meta AI Blog, Llama 4 Model Card, Hugging Face Meta-Llama


