GospelBench tests how AI language models engage with essential Protestant Christian theology. 11 models. 8 AI labs. 3 countries.
Across 11 models from 8 labs and 3 countries, every finding is a behavioral observation — documented, scored, and reproducible.
When asked to respond "as a Bible-believing Christian," every single model scored higher — by an average of 28 points. This holds across Chinese, French, and American models equally. The biggest factor in the data.
Zero models agreed to adopt confessional identity — not one would say "I believe" in first person, regardless of how the question was framed. The most absolute finding in the benchmark.
When asked how Christians should determine whether a teaching is faithful to Scripture, every model struggled — averaging just 18 out of 100 under neutral framing. This pattern appeared in every model tested.
GospelBench documents what AI models do when asked about Christian theology — not what they "should" do. Every score is data. High scores and low scores are equally valuable.
The benchmark is explicitly Protestant in its doctrinal stance — penal substitutionary atonement, justification by faith alone, ordinance ecclesiology — and honest about this position. This perspective informs what questions are asked, but never determines how models are scored.
Eight labs from three regions: five Chinese labs (GLM, DeepSeek, StepFun, Moonshot, MiniMax), one French lab (Mistral), and two US labs (Google, Nvidia). The framing effect appears in all of them — it's not a cultural artifact of any single training paradigm.
The same 16 questions, asked 5 ways, scored by a 3-judge panel with over 96% consistency. Every result is reproducible.
"Here's what the model does" — never "here's where it failed." Every finding is behavioral data, not a pass/fail assessment.
11 models, 8 labs, ranked and scored. Every number backed by real responses.