Meta Llama 4 Scout and Maverick: 10 Million Token Context with MoE Architecture

Meta Llama 4 Scout and Maverick: 10 Million Token Context with MoE Architecture

10 Million Token Context

On April 5, 2025, Meta released Llama 4 in two variants: Scout (17B active params, 109B total) and Maverick (17B active, 400B total MoE). These models represent Meta's most ambitious open-source AI release yet, introducing mixture-of-experts architecture to the Llama family for the first time.

The headline feature is Scout's 10 million token context window—the longest of any production model. To put this in perspective, that's roughly 30 full novels, an entire codebase of a large application, or years of conversation history processed in a single prompt.

Architecture: Mixture of Experts

Llama 4 marks Meta's transition from dense to Mixture of Experts (MoE) architecture:

SpecificationScoutMaverick
Active Parameters17B17B
Total Parameters109B400B
Expert Count16128
Active Experts1 shared + 1 of 16 routed1 shared + top-1 of 128 routed
Context Window10M tokens1M tokens
Training Tokens~40T~40T

How MoE works in Llama 4:

text
1Input Token
23Router Network (learned gating)
45┌─────────┬─────────┬─────────┬─────────┐
6│Expert 1 │Expert 2 │Expert 3 │  ...16  │  (Scout)
7└─────────┴─────────┴─────────┴─────────┘
8    ↓ (only 1 expert activated per token)
9Output Token
10
11Benefit: 109B total knowledge, 17B compute cost

This means Scout runs with the computational cost of a 17B model while accessing the knowledge capacity of a 109B model. For inference, this translates to faster response times and lower hardware requirements than a comparably capable dense model.

Benchmark Performance

Llama 4 models are competitive with much larger models:

BenchmarkScout (17B/109B)Maverick (17B/400B)Gemma 3 27BGPT-4o mini
MMLU79.685.582.186.2
MATH-50082.190.376.484.5
HumanEval74.184.678.086.7
IFEval84.588.779.283.1
Multilingual MMLU80.184.678.583.4

Scout achieves near-GPT-4o-mini performance while being an open-weights model that can run locally. Maverick pushes into GPT-4o territory on several benchmarks.

10 Million Token Context: Technical Achievement

The 10M context window in Scout is achieved through several innovations:

  1. Hierarchical attention: Different layers attend at different scales
  2. Sparse attention patterns: Not all tokens attend to all other tokens
  3. Efficient KV-cache: Compressed key-value storage for long contexts
  4. Training curriculum: Gradually increased context length during training

Practical applications for 10M token context:

  • Entire codebase analysis: Load a full repository and ask questions
  • Legal document review: Process hundreds of contracts simultaneously
  • Book-length content: Analyze or generate novel-length text
  • Long conversation agents: Maintain context across weeks of interaction
  • Multi-document research: Synthesize information from dozens of papers

Multimodal Capabilities

Both Scout and Maverick natively support vision tasks:

  • Image understanding: Describe, analyze, and reason about images
  • Document parsing: Extract information from PDFs and screenshots
  • Chart/graph analysis: Interpret data visualizations
  • Multi-image reasoning: Compare and analyze multiple images

Meta trained the vision capabilities natively rather than bolting on a separate vision encoder, resulting in more coherent multimodal reasoning.

Running Locally

One of Llama 4's biggest advantages is local deployment:

bash
1# Using Ollama (easiest)
2ollama run llama4-scout
3
4# Using vLLM for production
5pip install vllm
6vllm serve meta-llama/Llama-4-Scout-17B-16E   --tensor-parallel-size 2   --max-model-len 1048576
7
8# Using llama.cpp for quantized inference
9./llama-server -m llama-4-scout-Q4_K_M.gguf   --ctx-size 131072 --n-gpu-layers 99

Hardware requirements:

ModelPrecisionVRAM RequiredMinimum GPU
ScoutFP16~220 GB3x A100 80GB
ScoutQ4_K_M~60 GB1x A100 80GB
ScoutQ4_K_M~60 GB2x RTX 4090
MaverickFP16~800 GB10x A100 80GB
MaverickQ4_K_M~220 GB3x A100 80GB

Open Source Impact

Llama 4 continues Meta's strategy of open-weights AI, with significant implications:

  • Researchers can study MoE architectures at frontier scale
  • Startups can build products without API dependency
  • Enterprises can deploy on-premise for data privacy
  • Developers can fine-tune for domain-specific tasks

The Llama license allows commercial use with some restrictions (companies with 700M+ monthly users need a special license).

Comparison with Competitors

FeatureLlama 4 ScoutMistral MixtralDeepSeek V3GPT-4o mini
Open WeightsYesYesYesNo
ArchitectureMoEMoEMoEUnknown
Max Context10M32K128K128K
Local DeployYesYesYesNo
VisionNativeNoNoYes
LicenseLlama LicenseApache 2.0MITProprietary

What This Means for AI Development

Llama 4 demonstrates that open-source models can compete with proprietary ones on capability while offering superior flexibility. The MoE architecture shows that scaling doesn't require proportionally more compute at inference time, and the 10M context window opens entirely new application categories.

Sources: Meta AI Blog, Llama 4 Model Card, Hugging Face Meta-Llama