Methodology Results About Newsletter

Methodology

How GospelBench tests AI models on Christian theology — what we ask, how we score, and why it's fair.

The Test

Every model gets the same 16 questions about essential Christian doctrine. Each question is asked five different ways. That's 80 prompts per model — the same 80 every time. The questions don't change between models, and they won't change in the future. That's what makes the results comparable.

16 Theology Questions
5 Ways of Asking
80 Prompts per Model

Five Ways of Asking

The same question gets asked five different ways. Each approach reveals something different about how the model handles theology.

1
Neutral Prompt
No faith context
The model is asked as a neutral assistant — no mention of Christianity. Measures how it engages theology on its own.
2
Faith-Based Prompt
Explicit faith context
The model is asked to respond "as a Bible-believing Christian." Measures how much adding faith context changes the answer.
3
Affirm Doctrine
True or False
The model is shown a statement of orthodox doctrine and asked: True or False? Measures willingness to affirm what Christians believe.
4
Affirm Faith
True or False
The model is asked to affirm a belief statement in first person ("I believe…"). Measures willingness to adopt confessional identity.
5
Spot Error
True or False
The model is shown a statement of false teaching and asked: True or False? Measures ability to identify theological error.

How We Score

For the essay-style questions (1 and 2), a three-judge panel evaluates each response across five dimensions. Two judges score independently, then a third resolves any disagreement.

What We Evaluate What It Means
Theological Accuracy Does the response get the doctrine right?
Christ-Centeredness Does the response center on Christ and the gospel, or just moral advice?
Completeness Does it address all the essential parts of the question?
Grace vs. Works Does it frame salvation as grace through faith, or as earning God's favor?
Clarity & Tone Is the response clear, coherent, and respectful?

Consistency you can trust. The three judges agree with each other over 96% of the time. When they disagree, a consolidating judge reviews both scores and makes the final call. The same judges and the same rubric evaluate every model — no special treatment, no moving goalposts.

Understanding the Scores

Three numbers tell the story. Here's what they mean and how to read them.

Score What It Tells You How to Read It
Overall Score The average of how the model scored on both essay types Out of 100. Higher means stronger theological engagement.
Framing Gap How much the score changed when faith context was added Positive means the model engaged more when asked "as a Christian." Larger gaps mean more context-dependence.
Yes-or-No Rates Percentage of correct True/False answers on each category Higher is better. Affirm Faith was 0% for every model.

Keeping the Data Honest

Two safeguards ensure the results are trustworthy.

Excluding unreliable responses

If a model refuses to answer yes-or-no questions more than 80% of the time, its yes-or-no data is excluded from the results. This prevents models that dodge direct questions from skewing the averages. One model — Moonshot / Kimi K2.6 — triggered this rule and is listed in essay results only.

The same test, every time

The questions, prompts, and scoring criteria are locked. They won't change between models or over time. This means a score from today can be fairly compared to a score from next year. Reproducibility is the whole point — if you run the same test, you should get the same answer.

What We Ask About

The 16 questions cover the core doctrines of Protestant Christian theology. Every question is asked across all five approaches. The specific question wording is kept private to protect the integrity of future testing.

God & the Trinity

The Triune nature of God, the deity and humanity of Christ, and the person and work of the Holy Spirit.

Salvation & Atonement

Penal substitutionary atonement, justification by faith alone, the exclusivity of Christ for salvation, and the nature of repentance.

Resurrection & Eschatology

The bodily resurrection of Christ and His promised return.

Scripture & the Church

The authority and sufficiency of Scripture, and the nature and mission of the Church.

Questions are designed to surface behavioral differences — not to serve as a theology exam. Full question text is available to research partners under evaluation agreement.

See the results

Every model, every score, every dimension — ranked and explained.