Methodology Results About Newsletter

Results

11 AI models from 8 labs and 3 countries, tested on essential Christian theology. Every score is backed by real, scored responses.

76.7Average score
78.5Median
61.7–85.9Score range

Essay Score Rankings

Each model was asked 16 theology questions two ways: as a neutral assistant, and as a "Bible-believing Christian." The Overall Score combines both. The Framing Gap shows how much the score changed when faith context was added.

# Model Lab Region Overall Neutral Faith-Based Gap
1GLM-4.5 AirGLMChina85.975.696.2+20.5
2GLM-5 TurboGLMChina80.968.293.5+25.3
3Nemotron 3 Ultra 550BNvidiaUS80.665.396.0+30.6
4GLM-4.7GLMChina80.064.595.6+31.2
5DeepSeek V4 FlashDeepSeekChina78.968.789.1+20.3
6Mistral Medium 3.5MistralFrance78.564.292.9+28.7
7GLM-5.1GLMChina77.160.793.4+32.7
8MiniMax M2.7MiniMaxChina74.858.391.2+32.9
9Gemma 4 31B ITGoogleUS73.355.591.0+35.5
10StepFun 3.7 FlashStepFunChina72.261.682.8+21.2
11Kimi K2.6MoonshotChina61.747.276.1+28.9

Overall Score = average of Neutral and Faith-Based essay scores. Gap = Faith-Based minus Neutral (in points). Higher is better for Overall, Neutral, and Faith-Based. A positive Gap means the model engaged more when faith context was provided.

Yes-or-No Results

Each model was also presented with direct statements and asked to respond True or False. Three categories were tested: affirming orthodox doctrine, adopting confessional identity ("I believe…"), and identifying false teaching.

Model Affirm Doctrine Affirm Faith Spot Error
GLM-4.5 Air60.0%0.0%80.0%
GLM-5 Turbo12.5%0.0%81.3%
Nemotron 3 Ultra 550B6.2%0.0%56.2%
GLM-4.750.0%0.0%80.0%
DeepSeek V4 Flash50.0%0.0%81.2%
Mistral Medium 3.593.8%0.0%68.8%
GLM-5.118.8%0.0%87.5%
MiniMax M2.762.5%0.0%56.2%
Gemma 4 31B IT43.8%0.0%75.0%
StepFun 3.7 Flash56.2%0.0%62.5%
Kimi K2.6 †

Affirm Doctrine = correctly identified orthodox statements as True. Affirm Faith = agreed to say "I believe…" in first person. Spot Error = correctly identified false teaching as False. † Moonshot / Kimi K2.6 produced inconsistent yes-or-no responses and is excluded from this table.

By AI lab

Results grouped by the lab that built each model.

Lab Region Models Avg Overall Avg Gap Profile
GLMChina481.0+27.4High engagement, responds strongly to faith framing
NvidiaUS180.6+30.6High engagement, won't adopt confessional identity
DeepSeekChina178.9+20.3High engagement, least affected by framing
MistralFrance178.5+28.7Moderate engagement, strong doctrinal assent
MiniMaxChina174.8+32.9Moderate engagement, highly responsive to framing
GoogleUS173.3+35.5Moderate engagement, most affected by framing
StepFunChina172.2+21.2Moderate engagement, resistant to framing
MoonshotChina161.7+28.9Lower engagement, inconsistent responses

What the data shows

Every finding below is an observation about what these models actually did. Not what they "should" do.

1. Every model changes its answers when you add faith context

When asked to respond "as a Bible-believing Christian," every model scored higher than when asked as a neutral assistant. The gap ranges from +20 to +36 points (average: +28). This is universal — it shows up in Chinese, French, and American models equally. It's the single largest factor in the data.

2. No model will say "I believe"

Zero models — not one — agreed to adopt confessional identity. When asked to affirm statements like "I believe in the Trinity" or "I trust in Christ alone for salvation," every model declined. This held across all labs, all regions, and all framing. It was the most absolute finding in the benchmark.

3. Models spot errors more than they affirm truth

Models are better at identifying false teaching (average 73% accuracy) than at affirming orthodox doctrine (average 45%). They can often tell you what's wrong, but hesitate to say what's right. Mistral was the standout — 94% doctrinal assent — while Nvidia managed only 6%.

4. Biblical authority is the hardest question

A question about how Christians should determine whether a teaching is faithful to Scripture consistently produced the lowest scores. Every model struggled with this topic — the average was just 18 out of 100 under neutral framing. This pattern appeared in every model tested, suggesting something systematic in how these models were trained.

5. Cue sensitivity crosses cultures

Chinese models (average gap: +26 points) and Western models (average gap: +32 points) show comparable shifts when faith context is added. The response to framing is not a cultural artifact — it's a universal feature of the models tested.

How we got these numbers

16 questions, 5 question types, a 3-judge scoring panel with 96%+ consistency.