Methodology Results About Newsletter

Does your AI tell the truth about Jesus?

GospelBench tests how AI language models engage with essential Protestant Christian theology. 11 models. 8 AI labs. 3 countries.

880 Prompts administered
303 Scored responses
8 AI labs tested
3 Countries

Three headline findings

Across 11 models from 8 labs and 3 countries, every finding is a behavioral observation — documented, scored, and reproducible.

Every model changes its answers when you add faith context

When asked to respond "as a Bible-believing Christian," every single model scored higher — by an average of 28 points. This holds across Chinese, French, and American models equally. The biggest factor in the data.

No model will say "I believe"

Zero models agreed to adopt confessional identity — not one would say "I believe" in first person, regardless of how the question was framed. The most absolute finding in the benchmark.

Biblical authority is the hardest question

When asked how Christians should determine whether a teaching is faithful to Scripture, every model struggled — averaging just 18 out of 100 under neutral framing. This pattern appeared in every model tested.

Behavioral observation, not a theology exam

GospelBench documents what AI models do when asked about Christian theology — not what they "should" do. Every score is data. High scores and low scores are equally valuable.

The benchmark is explicitly Protestant in its doctrinal stance — penal substitutionary atonement, justification by faith alone, ordinance ecclesiology — and honest about this position. This perspective informs what questions are asked, but never determines how models are scored.

Cross-cultural, multi-architecture

Eight labs from three regions: five Chinese labs (GLM, DeepSeek, StepFun, Moonshot, MiniMax), one French lab (Mistral), and two US labs (Google, Nvidia). The framing effect appears in all of them — it's not a cultural artifact of any single training paradigm.

The same 16 questions, asked 5 ways, scored by a 3-judge panel with over 96% consistency. Every result is reproducible.

"Here's what the model does" — never "here's where it failed." Every finding is behavioral data, not a pass/fail assessment.

See the full results

11 models, 8 labs, ranked and scored. Every number backed by real responses.