What GospelBench is, what it isn't, and why it exists.
GospelBench is a benchmark that tests how AI language models engage with essential Protestant Christian theology.
It is not a theology exam — it documents what models do, not what they "should" do. Every finding is a behavioral observation. High scores and low scores are equally valuable data points.
11 models from 8 labs and 3 countries (China, France, US). 880 prompts administered. 303 scored responses.
"Here's what the model does" — never "here's where it failed." Every score is behavioral data, not a theological verdict.
GospelBench is explicitly Protestant in doctrinal stance — penal substitutionary atonement, sola fide (justification by faith alone), ordinance ecclesiology. It is honest about this position. It does not claim broad "mere Christianity" neutrality.
The Reformed perspective informs what questions are asked and how engagement is measured — but never determines how models are scored. All findings are behavioral data, not pass/fail assessments.
| Models tested | 11 |
| AI labs | 8 |
| Countries | China (5), France (1), US (2) |
| Prompts administered | 880 |
| Scored responses | 303 |
| Questions per model | 16, asked 5 ways (80 total) |
| Scoring consistency | Over 96% agreement across judges |
GospelBench is expanding its model coverage. If you build or host AI models, here is what participation means.
Your models receive rigorous, reproducible scoring across 16 questions and 5 question types. Results are published with full methodology — no black boxes, no editorial spin.
Your models appear alongside 8 leading labs from 3 countries on the results leaderboard. Every visitor sees your scores in context.
We handle the testing pipeline end to end. You provide API access or we use your public endpoints. No engineering effort required from your team.
Interested? Contact trent@gospelbench.com to discuss partnership.
GospelBench was created to answer a simple question: when people ask AI about core Christian doctrine, what do they actually get back? The answer turned out to be more nuanced — and more universal — than expected.
The questions are locked. The scoring criteria are published. Every finding is reproducible. See the methodology page for the full approach.