--- config: look: handDrawn theme: neutral --- flowchart LR A[Clinical problem] --> B[Target / label] B --> C[Data] C --> D[Model] D --> E[Metric] E --> F[Claim] F --> G[Decision in practice]
Scientific Frames for Evaluating AI in Clinical Practice · UBC Research Day · April 17, 2026
The problem is usually not that AI papers are incomprehensible.
It is that they can feel clearer than they really are.
A 📝 paper often gives you:
What 🧑⚕️ you still need to ask:
In the near term, you should be able to look at an AI paper and ask, almost automatically:
--- config: look: handDrawn theme: neutral --- flowchart LR A[Clinical problem] --> B[Target / label] B --> C[Data] C --> D[Model] D --> E[Metric] E --> F[Claim] F --> G[Decision in practice]
Tip
Your job is to inspect every arrow, not just the model box.
Many AI papers look similar, but they are asking very different questions.
These are not interchangeable forms of evidence.
Before you look at the model, rethink the target in plain (e.g.) English.
🤔 Ask:

🤔 Remember
“This model does not predict deterioration.
It predicts a label that we are using as a stand-in for deterioration.”
That one sentence often clarifies the whole paper.

Tip
OK, good. Risk of pneumonia increases with age.
Important
Uh oh, bad. Risk of pneumonia decreases if you have asthma??
Note
Whenever a clinical label might depend on earlier care, ask whether the model is accidentally learning:
That is often invisible in the abstract.
Before you care about performance, ask:
For ophthalmology papers, we typically want to know:
In machine learning, we usually partion available data into three silos:
Warning
This is often not enough.
Warning
Potential problems:
Note
The question is not just “Was there a test set?”
It is “What information could still leak across the splits?”
%%{init: {
"theme": "base",
"fontSize": 22,
"themeVariables": {
"fontSize": "22px"
}
}}%%
flowchart TB
D[Ophthalmic dataset] --> P[Patient]
P --> E[Eye]
E --> V[Visit]
V --> I[Image]
S[Site / device] --> I
I --> TR[Train split]
I --> TE[Test split]
%% Leakage routes
P -. same patient across splits .-> TR
P -. same patient across splits .-> TE
E -. fellow eyes across splits .-> TR
E -. fellow eyes across splits .-> TE
V -. serial visits split apart .-> TR
V -. serial visits split apart .-> TE
S -. shared acquisition signature .-> TR
S -. shared acquisition signature .-> TE
TR --> AP[Apparent high performance]
TE --> AP
AP -. may fail to reproduce .-> EV[Poor external validity]
classDef hierarchy fill:#ffffff,stroke:#64748b,stroke-width:2px,color:#0f172a;
classDef split fill:#e0f2fe,stroke:#38bdf8,stroke-width:2px,color:#0c4a6e;
classDef split2 fill:#dcfce7,stroke:#22c55e,stroke-width:2px,color:#166534;
classDef warn fill:#fef2f2,stroke:#ef4444,stroke-width:2px,color:#991b1b;
classDef ctx fill:#fdf2f8,stroke:#ec4899,stroke-width:2px,color:#9d174d;
class D,P,E,V,I hierarchy;
class S ctx;
class TR split;
class TE split2;
class AP,EV warn;
linkStyle 5,6,7,8,9,10,11,12 stroke:#dc2626,stroke-width:3px,stroke-dasharray: 6 4;
linkStyle 15 stroke:#dc2626,stroke-width:3px,stroke-dasharray: 6 4;
A strong clinical AI paper often needs more than one experiment.
Ideally some combination of:
A 2026 review identified 75 published silent trials in medical AI from 2015 to 2025 (Tikhomirov et al. 2026).
Silent trials do something deeply valuable: they approximate the real deployment setting without (yet) changing care.
That lets teams see whether:
If you only read retrospective ‘methods’ papers, you will consistently overestimate readiness.
What you want is evidence that gets progressively closer to your actual environment:
Note
AI papers are like social media – everyone reports their 🤩 highlights and this gives you a false sense of reality.
If the paper gives you one number, be suspicious.
A single metric is almost never enough to evaluate a clinical AI system.
For classification, at minimum we usually want some subset of:
Four formulas worth remembering:
\[ \text{Sensitivity} = \frac{TP}{TP + FN} \qquad \text{Specificity} = \frac{TN}{TN + FP} \]
\[ \text{PPV} = \frac{TP}{TP + FP} \qquad \text{NPV} = \frac{TN}{TN + FN} \]
Even with the same sensitivity and specificity, PPV and NPV move with prevalence.
So a system that looks excellent in one setting may behave very differently in another.
With \(\color{red}{Y\in\{0,1\}}\) (vision-threatening disease), and \(\color{blue}{\hat{p}}\) the model’s predicted probability of vision-threatening disease, we want:
\[ P(\color{red}{Y=1} \mid \color{blue}{\hat{p} \approx 0.8}) \approx 0.8 \]
If not, then even a model with good ranking performance may mislead treatment thresholds, triage, or counselling.
Warning
In clinical settings, a badly calibrated “high risk” label can be operationally more damaging than a mediocre AUROC.
Believe those who are seeking the truth; doubt those who find it.
André Gide
We trust papers more when they report:
The absence of these often signals 💪 overconfidence or 🤩 hype-chasing.
Responsible AI often emphasizes both:
At first glance, these values align. But in practice, they can be in tension:
Paradox
The more we protect privacy by not collecting sensitive data, the harder it becomes to see and correct bias. But the more data we collect for fairness audits, the more we risk privacy harms.

From Fusar-Poli et al. (2022).
Not all bias is illegal or even (always) wrong — a spam filter should be biased against phishing emails, e.g..
The hard question: which systematic errors are acceptable, for whom, and who decides?
In ophthalmology, image-based AI can look beautifully clean in the abstract while hiding messy operational realities:
Note
This is why implementation papers and human-centered studies matter so much (Beede et al. 2020; Teng et al. 2026).
A 2025 age-related macular degeneration study in JAMA Network Open is useful because it goes beyond a single held-out test set (Chen et al. 2025).
Findings (not just ‘accuracy’, not just once):

A 2026 Nature Medicine randomized study tested whether LLMs help members of the public make medical decisions in realistic scenarios (Bean et al. 2026).

Bernstein et al. (2023) compared ophthalmologists to a chatbot on online eye-care questions. Reviewers rated the quality of many answers similarly, and distinguished human from chatbot responses with only 61% accuracy.
Interesting, but still ask:

Believe those who are seeking the truth; doubt those who find it.
André Gide
Thank you.

UBC Research Day 2026 · Resident Session