
UBC Research Day · April 17, 2026
Except only tangentially, I have not worked in AI + ophthalmology.
I have worked in AI + surgery, which uses similar techniques (in academia, industry, and standards) and I currently work in AI Safety, which flavours some of this talk.


Between January 2022 and September 2025, researchers published 4,609 peer-reviewed papers on large language models in clinical medicine.
Of those, 19 were prospective randomized trials (S. F. Chen et al. 2026).

The limiting problem in medical AI is often not the model.
It is the repeated substitution of one thing for another:
🤔 Claim
🎉Hype and 😱 fear are both downstream symptoms of these substitutions.
--- config: look: handDrawn theme: neutral --- flowchart LR A[Compelling demo] --> B[Benchmark headlines] B --> C[Pilot deployment] C --> D[Workflow friction<br/>missed edge cases<br/>trust problems] D --> E[Backlash / fear] E --> F[Reset] F --> A style D fill:#F7F4EC,stroke:#5B534A,stroke-width:3px,color:red
Note
The point is not that the enthusiasm is irrational.
The point is that we seem to expect things to always work perfectly.
This expectation arise from a combination of ‘category mistakes’…
A 2026 Nature Medicine randomized study tested whether LLMs actually help members of the public make medical decisions in realistic scenarios (Bean et al. 2026).

💡 Lesson - “PEBKAC”
A model can perform well in isolation and still fail as a tool for human decision-making.
This is not a weird corner case.
Nor is it isolated to non-experts.
It is exactly what we should expect when test conditions and deployment conditions are different.
A 2026 JAMA Ophthalmology “Eye on AI” theme explicitly argued that ophthalmology is one of the places where AI may be transformative, especially through image analysis and natural language processing (Liu and Bressler 2026).
AI in ophthalmology makes sense:

In a 2023 eye-care study, ophthalmologists and a chatbot produced answers to 200 patient forum questions with broadly similar reviewer-rated quality,
But it still does not answer the harder questions:
A practical standard
For clinical adoption, we should care much more about:
than about a single headline number.
How do we change how we evaluate AI in medicine?
A 2024 Lancet Digital Health scoping review found 86 RCTs of AI in clinical practice; 70 of 86 (81%) reported positive primary endpoints (Han et al. 2024).
That is genuinely encouraging but the same review flags reasons for caution:
Positive RCTs are good news.
They are not yet proof that results will hold across sites, populations, and workflows.
Negative results are not (necessarily) bad news.
50% accuracy, e.g., feels like a failing grade, but does it beat human performance? Does it still allow us to prioritize patient care if it will safely save time?
For deployment, a useful scorecard should usually include multiple families of outcomes:
A 2025 JAMA Network Open study on age-related macular degeneration built an AI-assisted workflow and evaluated clinicians across four rounds, alternating manual diagnosis and diagnosis with AI assistance (Q. Chen et al. 2025).
Findings (not just ‘accuracy’, not just once):

From weakest to strongest:
Most arguments in public discourse are made as if stage 1 or 2 already implies stage 6.

A moment with which to sit 🪑
Think of one AI system you have recently heard about, been pitched, or seen used in a clinical setting.
Which rung of that ladder does the evidence for it occupy?
⬇️ Top-down or bottom-up ⬆️?
Do you feel that this AI system is being forced on you, or did it originate organically?
Note
This is a classic misalignment problem.
A widely used health-care algorithm looked “accurate”, but it predicted future cost, not future health need.
Fixing the target would have increased the proportion of Black patients flagged for extra help from 17.7% to 46.5% (Obermeyer et al. 2019).
Translation
🧮 The model may be optimizing exactly what you asked for.
🫵 That does not mean you asked for the right thing.
Note
Another canonical example: the pneumonia risk model of Caruana et al. (2015).
The model appeared to learn that asthma lowered pneumonia mortality risk.

What the model had learned was a channel effect: patients with asthma and pneumonia were more likely to be treated aggressively, including ICU admission.
So the model captured part of the care pathway, not just the disease process.
When people say “bias” in medical AI, they often compress very different problems into one word.
At least four distinct layers matter:

In ophthalmology, bias and failure can arise before pathology interpretation even begins.
Examples:
This is one reason autonomous diabetic-retinopathy systems require not just approval but thoughtful implementation (Abràmoff et al. 2018; Teng et al. 2026).
😱 Fear that overgeneralizes from bad examples carries its own harms:
The symmetric categorical error
🎉 Hype overgeneralizes from good examples.
😱 Fear overgeneralizes from bad ones.
Both collapse different interventions into a single moral object called “AI”.
🟢 Lower-risk uses
🟡 Medium-risk uses
🔴 Higher-risk uses
Before trusting a model, paper, or vendor pitch, ask:
The next wave of medical AI papers should spend less energy on:
and more energy on:
Institutions should treat AI less like a gadget and more like a clinical intervention plus infrastructure decision.
That means:
🧠 Think
Not
“Is AI ready for medicine?”
But this
“Which uses of AI are justified, for whom, under what evidence, and with what continuing accountability?”
The cycle is not inevitable.
The macular degeneration workflow study by Q. Chen et al. (2025) is a useful model.
Note
That is not a study about whether the model is impressive.
It is a study about whether the system — the human plus the model — improves care.
The patient trajectory (pun intended) that avoids the cycle tends to share a structure:
Note
This is not slower than the Sisyphean rush to the peak.
It is faster — because it does not end in a backlash that resets the clock.
Khattak et al. (2025) provide an MLHOps checklist for clinical ML to be deployable, monitorable, and maintainable by:
Choose local, free-range, organic AI!

The cycle becomes Sisyphean when we forget, again and again through our haste, to build in (and validate!) our ⛔️safeguards⛔️.
These safeguards represent choices about: what counts as evidence, what questions get asked before deployment, and how we handoff responsibilities.
🤔 Stop and think
Don’t be “anti-AI” or “pro-AI”.
Be more mindful of the “big picture”
Thank you.

UBC Research Day 2026 · Keynote Lecture