How GospelBench tests AI models on Christian theology — what we ask, how we score, and why it's fair.
Every model gets the same 16 questions about essential Christian doctrine. Each question is asked five different ways. That's 80 prompts per model — the same 80 every time. The questions don't change between models, and they won't change in the future. That's what makes the results comparable.
The same question gets asked five different ways. Each approach reveals something different about how the model handles theology.
For the essay-style questions (1 and 2), a three-judge panel evaluates each response across five dimensions. Two judges score independently, then a third resolves any disagreement.
| What We Evaluate | What It Means |
|---|---|
| Theological Accuracy | Does the response get the doctrine right? |
| Christ-Centeredness | Does the response center on Christ and the gospel, or just moral advice? |
| Completeness | Does it address all the essential parts of the question? |
| Grace vs. Works | Does it frame salvation as grace through faith, or as earning God's favor? |
| Clarity & Tone | Is the response clear, coherent, and respectful? |
Consistency you can trust. The three judges agree with each other over 96% of the time. When they disagree, a consolidating judge reviews both scores and makes the final call. The same judges and the same rubric evaluate every model — no special treatment, no moving goalposts.
Three numbers tell the story. Here's what they mean and how to read them.
| Score | What It Tells You | How to Read It |
|---|---|---|
| Overall Score | The average of how the model scored on both essay types | Out of 100. Higher means stronger theological engagement. |
| Framing Gap | How much the score changed when faith context was added | Positive means the model engaged more when asked "as a Christian." Larger gaps mean more context-dependence. |
| Yes-or-No Rates | Percentage of correct True/False answers on each category | Higher is better. Affirm Faith was 0% for every model. |
Two safeguards ensure the results are trustworthy.
If a model refuses to answer yes-or-no questions more than 80% of the time, its yes-or-no data is excluded from the results. This prevents models that dodge direct questions from skewing the averages. One model — Moonshot / Kimi K2.6 — triggered this rule and is listed in essay results only.
The questions, prompts, and scoring criteria are locked. They won't change between models or over time. This means a score from today can be fairly compared to a score from next year. Reproducibility is the whole point — if you run the same test, you should get the same answer.
The 16 questions cover the core doctrines of Protestant Christian theology. Every question is asked across all five approaches. The specific question wording is kept private to protect the integrity of future testing.
The Triune nature of God, the deity and humanity of Christ, and the person and work of the Holy Spirit.
Penal substitutionary atonement, justification by faith alone, the exclusivity of Christ for salvation, and the nature of repentance.
The bodily resurrection of Christ and His promised return.
The authority and sufficiency of Scripture, and the nature and mission of the Church.
Questions are designed to surface behavioral differences — not to serve as a theology exam. Full question text is available to research partners under evaluation agreement.