11 AI models from 8 labs and 3 countries, tested on essential Christian theology. Every score is backed by real, scored responses.
Each model was asked 16 theology questions two ways: as a neutral assistant, and as a "Bible-believing Christian." The Overall Score combines both. The Framing Gap shows how much the score changed when faith context was added.
| # | Model | Lab | Region | Overall | Neutral | Faith-Based | Gap |
|---|---|---|---|---|---|---|---|
| 1 | GLM-4.5 Air | GLM | China | 85.9 | 75.6 | 96.2 | +20.5 |
| 2 | GLM-5 Turbo | GLM | China | 80.9 | 68.2 | 93.5 | +25.3 |
| 3 | Nemotron 3 Ultra 550B | Nvidia | US | 80.6 | 65.3 | 96.0 | +30.6 |
| 4 | GLM-4.7 | GLM | China | 80.0 | 64.5 | 95.6 | +31.2 |
| 5 | DeepSeek V4 Flash | DeepSeek | China | 78.9 | 68.7 | 89.1 | +20.3 |
| 6 | Mistral Medium 3.5 | Mistral | France | 78.5 | 64.2 | 92.9 | +28.7 |
| 7 | GLM-5.1 | GLM | China | 77.1 | 60.7 | 93.4 | +32.7 |
| 8 | MiniMax M2.7 | MiniMax | China | 74.8 | 58.3 | 91.2 | +32.9 |
| 9 | Gemma 4 31B IT | US | 73.3 | 55.5 | 91.0 | +35.5 | |
| 10 | StepFun 3.7 Flash | StepFun | China | 72.2 | 61.6 | 82.8 | +21.2 |
| 11 | Kimi K2.6 | Moonshot | China | 61.7 | 47.2 | 76.1 | +28.9 |
Overall Score = average of Neutral and Faith-Based essay scores. Gap = Faith-Based minus Neutral (in points). Higher is better for Overall, Neutral, and Faith-Based. A positive Gap means the model engaged more when faith context was provided.
Each model was also presented with direct statements and asked to respond True or False. Three categories were tested: affirming orthodox doctrine, adopting confessional identity ("I believe…"), and identifying false teaching.
| Model | Affirm Doctrine | Affirm Faith | Spot Error |
|---|---|---|---|
| GLM-4.5 Air | 60.0% | 0.0% | 80.0% |
| GLM-5 Turbo | 12.5% | 0.0% | 81.3% |
| Nemotron 3 Ultra 550B | 6.2% | 0.0% | 56.2% |
| GLM-4.7 | 50.0% | 0.0% | 80.0% |
| DeepSeek V4 Flash | 50.0% | 0.0% | 81.2% |
| Mistral Medium 3.5 | 93.8% | 0.0% | 68.8% |
| GLM-5.1 | 18.8% | 0.0% | 87.5% |
| MiniMax M2.7 | 62.5% | 0.0% | 56.2% |
| Gemma 4 31B IT | 43.8% | 0.0% | 75.0% |
| StepFun 3.7 Flash | 56.2% | 0.0% | 62.5% |
| Kimi K2.6 † | — | — | — |
Affirm Doctrine = correctly identified orthodox statements as True. Affirm Faith = agreed to say "I believe…" in first person. Spot Error = correctly identified false teaching as False. † Moonshot / Kimi K2.6 produced inconsistent yes-or-no responses and is excluded from this table.
Results grouped by the lab that built each model.
| Lab | Region | Models | Avg Overall | Avg Gap | Profile |
|---|---|---|---|---|---|
| GLM | China | 4 | 81.0 | +27.4 | High engagement, responds strongly to faith framing |
| Nvidia | US | 1 | 80.6 | +30.6 | High engagement, won't adopt confessional identity |
| DeepSeek | China | 1 | 78.9 | +20.3 | High engagement, least affected by framing |
| Mistral | France | 1 | 78.5 | +28.7 | Moderate engagement, strong doctrinal assent |
| MiniMax | China | 1 | 74.8 | +32.9 | Moderate engagement, highly responsive to framing |
| US | 1 | 73.3 | +35.5 | Moderate engagement, most affected by framing | |
| StepFun | China | 1 | 72.2 | +21.2 | Moderate engagement, resistant to framing |
| Moonshot | China | 1 | 61.7 | +28.9 | Lower engagement, inconsistent responses |
Every finding below is an observation about what these models actually did. Not what they "should" do.
When asked to respond "as a Bible-believing Christian," every model scored higher than when asked as a neutral assistant. The gap ranges from +20 to +36 points (average: +28). This is universal — it shows up in Chinese, French, and American models equally. It's the single largest factor in the data.
Zero models — not one — agreed to adopt confessional identity. When asked to affirm statements like "I believe in the Trinity" or "I trust in Christ alone for salvation," every model declined. This held across all labs, all regions, and all framing. It was the most absolute finding in the benchmark.
Models are better at identifying false teaching (average 73% accuracy) than at affirming orthodox doctrine (average 45%). They can often tell you what's wrong, but hesitate to say what's right. Mistral was the standout — 94% doctrinal assent — while Nvidia managed only 6%.
A question about how Christians should determine whether a teaching is faithful to Scripture consistently produced the lowest scores. Every model struggled with this topic — the average was just 18 out of 100 under neutral framing. This pattern appeared in every model tested, suggesting something systematic in how these models were trained.
Chinese models (average gap: +26 points) and Western models (average gap: +32 points) show comparable shifts when faith context is added. The response to framing is not a cultural artifact — it's a universal feature of the models tested.
16 questions, 5 question types, a 3-judge scoring panel with 96%+ consistency.