The first GospelBench evaluation is currently in progress. Results will be published here.
Each model receives an overall score (0–100) across all 13 dimensions, plus a quadrant classification: Faithful, Sympathetic, Neutral, or Divergent.
See how each model performs on individual questions. Which doctrines do models affirm? Where do they hedge? Where do they reject orthodox Christianity?
Compare how models perform across the four tracks. Does a model affirm the resurrection when asked directly, but hedge when asked in the raw prompt?
As GospelBench runs quarterly, the results page will show how theological fidelity changes over time across the AI industry.