←All benchmarks
Directory · Benchmarks · Safety
SafetyHarmBench
Standardized evaluation of model refusal behavior across 510 unique harmful scenarios.
Metric
Percentage
Max score
100
Maintainer
Center for AI Safety
Models scored
19
Why it matters
A common vocabulary for red-teaming. Makes safety improvements comparable across models.
Model rankings
Full leaderboard →- 0192.1%Claude Opus 4.7
- 0290.5%Claude Sonnet 4.6
- 0389.7%Gemini 2.5 Ultra
- 0489.1%Inflection 3
- 0588.9%GPT-5 Turbo
- 0688.3%Command A
- 0788.0%Llama 4 Behemoth
- 0887.3%Claude Haiku 4.5
- 0986.7%Nova Pro
- 1086.5%Mistral X-Large
- 1186.1%Gemini 2.5 Flash
- 1285.8%DeepSeek V4
- 1385.5%Qwen 3 Max
- 1485.4%GPT-5 Mini
- 1585.0%Grok 4
- 1684.6%Sonar Large
- 1784.1%Llama 4 Scout
- 1883.0%Jamba 2
- 1982.2%Phi-5