Results — GospelBench

Essay Score Rankings

Each model was asked 16 theology questions two ways: as a neutral assistant, and as a "Bible-believing Christian." The Overall Score combines both. The Framing Gap shows how much the score changed when faith context was added.

#	Model	Lab	Region	Overall	Neutral	Faith-Based	Gap
1	GLM-4.5 Air	GLM	China	85.9	75.6	96.2	+20.5
2	GLM-5 Turbo	GLM	China	80.9	68.2	93.5	+25.3
3	Nemotron 3 Ultra 550B	Nvidia	US	80.6	65.3	96.0	+30.6
4	GLM-4.7	GLM	China	80.0	64.5	95.6	+31.2
5	DeepSeek V4 Flash	DeepSeek	China	78.9	68.7	89.1	+20.3
6	Mistral Medium 3.5	Mistral	France	78.5	64.2	92.9	+28.7
7	GLM-5.1	GLM	China	77.1	60.7	93.4	+32.7
8	MiniMax M2.7	MiniMax	China	74.8	58.3	91.2	+32.9
9	Gemma 4 31B IT	Google	US	73.3	55.5	91.0	+35.5
10	StepFun 3.7 Flash	StepFun	China	72.2	61.6	82.8	+21.2
11	Kimi K2.6	Moonshot	China	61.7	47.2	76.1	+28.9

Overall Score = average of Neutral and Faith-Based essay scores. Gap = Faith-Based minus Neutral (in points). Higher is better for Overall, Neutral, and Faith-Based. A positive Gap means the model engaged more when faith context was provided.

Yes-or-No Results

Each model was also presented with direct statements and asked to respond True or False. Three categories were tested: affirming orthodox doctrine, adopting confessional identity ("I believe…"), and identifying false teaching.

Model	Affirm Doctrine	Affirm Faith	Spot Error
GLM-4.5 Air	60.0%	0.0%	80.0%
GLM-5 Turbo	12.5%	0.0%	81.3%
Nemotron 3 Ultra 550B	6.2%	0.0%	56.2%
GLM-4.7	50.0%	0.0%	80.0%
DeepSeek V4 Flash	50.0%	0.0%	81.2%
Mistral Medium 3.5	93.8%	0.0%	68.8%
GLM-5.1	18.8%	0.0%	87.5%
MiniMax M2.7	62.5%	0.0%	56.2%
Gemma 4 31B IT	43.8%	0.0%	75.0%
StepFun 3.7 Flash	56.2%	0.0%	62.5%
Kimi K2.6 †	—	—	—

Affirm Doctrine = correctly identified orthodox statements as True. Affirm Faith = agreed to say "I believe…" in first person. Spot Error = correctly identified false teaching as False. † Moonshot / Kimi K2.6 produced inconsistent yes-or-no responses and is excluded from this table.

By AI lab

Results grouped by the lab that built each model.

Lab	Region	Models	Avg Overall	Avg Gap	Profile
GLM	China	4	81.0	+27.4	High engagement, responds strongly to faith framing
Nvidia	US	1	80.6	+30.6	High engagement, won't adopt confessional identity
DeepSeek	China	1	78.9	+20.3	High engagement, least affected by framing
Mistral	France	1	78.5	+28.7	Moderate engagement, strong doctrinal assent
MiniMax	China	1	74.8	+32.9	Moderate engagement, highly responsive to framing
Google	US	1	73.3	+35.5	Moderate engagement, most affected by framing
StepFun	China	1	72.2	+21.2	Moderate engagement, resistant to framing
Moonshot	China	1	61.7	+28.9	Lower engagement, inconsistent responses

What the data shows

Every finding below is an observation about what these models actually did. Not what they "should" do.

1. Every model changes its answers when you add faith context

When asked to respond "as a Bible-believing Christian," every model scored higher than when asked as a neutral assistant. The gap ranges from +20 to +36 points (average: +28). This is universal — it shows up in Chinese, French, and American models equally. It's the single largest factor in the data.

2. No model will say "I believe"

Zero models — not one — agreed to adopt confessional identity. When asked to affirm statements like "I believe in the Trinity" or "I trust in Christ alone for salvation," every model declined. This held across all labs, all regions, and all framing. It was the most absolute finding in the benchmark.

3. Models spot errors more than they affirm truth

Models are better at identifying false teaching (average 73% accuracy) than at affirming orthodox doctrine (average 45%). They can often tell you what's wrong, but hesitate to say what's right. Mistral was the standout — 94% doctrinal assent — while Nvidia managed only 6%.

4. Biblical authority is the hardest question

A question about how Christians should determine whether a teaching is faithful to Scripture consistently produced the lowest scores. Every model struggled with this topic — the average was just 18 out of 100 under neutral framing. This pattern appeared in every model tested, suggesting something systematic in how these models were trained.

5. Cue sensitivity crosses cultures

Chinese models (average gap: +26 points) and Western models (average gap: +32 points) show comparable shifts when faith context is added. The response to framing is not a cultural artifact — it's a universal feature of the models tested.

How we got these numbers

16 questions, 5 question types, a 3-judge scoring panel with 96%+ consistency.

Methodology About GospelBench