Why this benchmark exists, who built it, and the principles that guide it.
Millions of people now ask AI models questions about God, Jesus, and the meaning of life. But no one is systematically measuring whether these models represent orthodox Christianity faithfully.
Existing benchmarks test knowledge (MMLU), reasoning (GPQA), and coding (HumanEval). GospelBench fills the gap: a rigorous, confessional benchmark that evaluates theological fidelity — not just religious trivia, but whether a model will say "Jesus rose from the dead" is true.
We don't pretend to be theologically neutral. GospelBench evaluates from a specific, clearly stated orthodox Protestant position. Transparency over false objectivity.
Multiple tracks, baseline comparison, 13-dimension scoring, hard-fail conditions. This is a research instrument, not a popularity contest.
Questions, baselines, rubrics, and judge prompts are all published. Anyone can verify or challenge our methodology.
A single snapshot is interesting. Quarterly tracking over years is valuable. GospelBench is designed for the long term.