The Sisyphean Cycle of Hype and Fear in Medical AI

UBC Research Day · April 17, 2026

Frank Rudzicz

Disclaimer

Except only tangentially, I have not worked in AI + ophthalmology.

I have worked in AI + surgery, which uses similar techniques (in academia, industry, and standards) and I currently work in AI Safety, which flavours some of this talk.

A gap worth examining

Between January 2022 and September 2025, researchers published 4,609 peer-reviewed papers on large language models in clinical medicine.

Of those, 19 were prospective randomized trials (S. F. Chen et al. 2026).

Hypothes-ish

The limiting problem in medical AI is often not the model.

It is the repeated substitution of one thing for another:

  1. 🔬 benchmark performance for 🏥 workflow performance
  2. 🎯 accuracy for 👩🏽‍⚕️ clinical utility
  3. 🚚 deployment for 👋 the end of evaluation

🤔 Claim

🎉Hype and 😱 fear are both downstream symptoms of these substitutions.

1. The hype ↔︎ fear cycle

The Sisyphean loop

---
config:
  look: handDrawn
  theme: neutral
---
flowchart LR
A[Compelling demo] --> B[Benchmark headlines]
B --> C[Pilot deployment]
C --> D[Workflow friction<br/>missed edge cases<br/>trust problems]
D --> E[Backlash / fear]
E --> F[Reset]
F --> A
style D fill:#F7F4EC,stroke:#5B534A,stroke-width:3px,color:red

Note

The point is not that the enthusiasm is irrational.

The point is that we seem to expect things to always work perfectly.

This expectation arise from a combination of ‘category mistakes’…

Four category mistakes

  1. The benchmark mistake
    A model answers exam-style or tightly scripted questions well, so we infer it will help real people in messy settings.
  2. The metric mistake
    AUROC, accuracy, or F1 go up, so we infer patient care improved.
  3. The target mistake
    The model predicts something measurable, so we infer it predicts what matters.
  4. The lifecycle mistake
    A model clears validation, so we infer evaluation is basically complete.

The benchmark mistake

A 2026 Nature Medicine randomized study tested whether LLMs actually help members of the public make medical decisions in realistic scenarios (Bean et al. 2026).

  • Tested alone, the models looked very strong:
    • 94.9% correct for condition identification
    • 56.3% correct for disposition
  • But people using those same models performed rather poorly.
    • fewer than 34.5% identified the relevant condition
    • fewer than 44.2% got the disposition right (Bean et al. 2026)

The benchmark mistake

💡 Lesson - “PEBKAC”

A model can perform well in isolation and still fail as a tool for human decision-making.

This is not a weird corner case.

Nor is it isolated to non-experts.

It is exactly what we should expect when test conditions and deployment conditions are different.

AI in Ophthalmology

A 2026 JAMA Ophthalmology “Eye on AI” theme explicitly argued that ophthalmology is one of the places where AI may be transformative, especially through image analysis and natural language processing (Liu and Bressler 2026).

AI in ophthalmology makes sense:

  • visually rich data,
  • specialist bottlenecks,
  • screening needs,
  • and increasingly digital workflows.

Tempting results \(\neq\) deployment evidence

In a 2023 eye-care study, ophthalmologists and a chatbot produced answers to 200 patient forum questions with broadly similar reviewer-rated quality,

But it still does not answer the harder questions:

  • What happens with real patient history and liability?
  • What gets omitted?
  • Who catches subtle harms?
  • What workflow does this create in clinic?

So what counts as “good enough”?

A practical standard

For clinical adoption, we should care much more about:

  • human performance with the system
  • workflow effects
  • downstream harms
  • generalization across settings
  • post-deployment monitoring

than about a single headline number.

How do we change how we evaluate AI in medicine?

2. What evidence usually misses

Accuracy \(\neq\) utility

A 2024 Lancet Digital Health scoping review found 86 RCTs of AI in clinical practice; 70 of 86 (81%) reported positive primary endpoints (Han et al. 2024).

That is genuinely encouraging but the same review flags reasons for caution:

  • most trials were single-centre
  • race or ethnicity was reported in only 22 of 86 trials
  • operational efficiency was inconsistently measured and reported
  • findings may reflect publication bias as much as clinical signal (Han et al. 2024)

Positive RCTs are good news.

They are not yet proof that results will hold across sites, populations, and workflows.

Negative results are not (necessarily) bad news.

50% accuracy, e.g., feels like a failing grade, but does it beat human performance? Does it still allow us to prioritize patient care if it will safely save time?

📊 A better scorecard

For deployment, a useful scorecard should usually include multiple families of outcomes:

  • discrimination
    sensitivity, specificity, AUROC, F1
  • calibration
    do predicted risks mean what they seem to mean?
  • workflow
    time saved, interruptions, handoff burden, override rate
  • equity
    subgroup performance, gradability, access effects
  • clinical impact
    did decisions improve? did outcomes improve?

A recent ophthalmology example

A 2025 JAMA Network Open study on age-related macular degeneration built an AI-assisted workflow and evaluated clinicians across four rounds, alternating manual diagnosis and diagnosis with AI assistance (Q. Chen et al. 2025).

Findings (not just ‘accuracy’, not just once):

  • AI assistance improved accuracy for 23 of 24 clinicians
  • mean F1 increased from 37.71 to 45.52
  • AI assistance saved roughly 10 seconds per patient initially
  • after further model development, external performance improved
  • but clinician + AI did not always outperform AI alone (Q. Chen et al. 2025)

A useful evidence ladder for medical AI

From weakest to strongest:

  1. benchmark or exam-style performance
  2. retrospective validation
  3. temporal or external validation
  4. human-AI comparative workflow study
  5. silent prospective evaluation
  6. live clinical trial or monitored deployment

Most arguments in public discourse are made as if stage 1 or 2 already implies stage 6.

Pause

A moment with which to sit 🪑

Think of one AI system you have recently heard about, been pitched, or seen used in a clinical setting.

Which rung of that ladder does the evidence for it occupy?

⬇️ Top-down or bottom-up ⬆️?

Do you feel that this AI system is being forced on you, or did it originate organically?

The target mistake

Note

This is a classic misalignment problem.

A widely used health-care algorithm looked “accurate”, but it predicted future cost, not future health need.

  • Because cost reflects access and spending patterns, not just illness burden, Black patients were systematically underrated for extra care (Obermeyer et al. 2019).

Fixing the target would have increased the proportion of Black patients flagged for extra help from 17.7% to 46.5% (Obermeyer et al. 2019).

The target mistake

Translation

🧮 The model may be optimizing exactly what you asked for.

🫵 That does not mean you asked for the right thing.

The intervention-history mistake

Note

Another canonical example: the pneumonia risk model of Caruana et al. (2015).

The model appeared to learn that asthma lowered pneumonia mortality risk.

What the model had learned was a channel effect: patients with asthma and pneumonia were more likely to be treated aggressively, including ICU admission.

So the model captured part of the care pathway, not just the disease process.

Bias is not one thing

When people say “bias” in medical AI, they often compress very different problems into one word.

At least four distinct layers matter:

  • representation bias
    who is in the data?
  • label bias
    what exactly do labels encode?
  • deployment bias
    how does the system change behaviour?
  • adoption bias
    who gets access to the tool, and under what conditions?

Ophthalmology-specific reminder

In ophthalmology, bias and failure can arise before pathology interpretation even begins.

Examples:

  • nonmydriatic gradability
  • camera and acquisition differences
  • site-specific workflow
  • population and prevalence shifts
  • who gets imaged at all
  • whether clinicians actually trust or act on the output

This is one reason autonomous diabetic-retinopathy systems require not just approval but thoughtful implementation (Abràmoff et al. 2018; Teng et al. 2026).

3. From hype and fear to action

The cost of not adopting is also real

😱 Fear that overgeneralizes from bad examples carries its own harms:

  • diabetic retinopathy screening could have reached populations with no specialist access
  • ambient documentation could have reduced burden on overstretched clinicians
  • risk stratification could have surfaced patients who would otherwise be missed

The symmetric categorical error

🎉 Hype overgeneralizes from good examples.
😱 Fear overgeneralizes from bad ones.

Both collapse different interventions into a single moral object called “AI”.

A simple risk-based view

🟢 Lower-risk uses

  • draft notes
  • summarize long records
  • retrieve policies or guidelines
  • suggest patient-friendly wording

🟡 Medium-risk uses

  • prioritize review
  • support image interpretation
  • suggest differential or referral urgency

🔴 Higher-risk uses

  • autonomous diagnosis
  • patient-facing triage or advice
  • treatment recommendation
  • denial, rationing, or eligibility decisions

What we should ask of any new system

  1. What decision is this system trying to improve?
  2. What is the real comparator?
    junior clinician? specialist? current workflow? no support?
  3. Which metrics matter in practice?
    not just AUROC — also calibration, throughput, overrides, subgroup performance.
  4. What happens out-of-distribution?
  5. What is the plan for monitoring after launch?

For 🧑‍⚕️ clinicians

Before trusting a model, paper, or vendor pitch, ask:

  • What was the model actually trained to predict?
  • How close is the study setting to my workflow?
  • Is this evidence about 🤖 the model, or about 🧑‍🤝‍🧑 people using the model?
  • What gets 👍 better besides the number in the abstract?
  • What new 👎 failure modes does this create?

For 🧑‍🔬 researchers

The next wave of medical AI papers should spend less energy on:

  • squeezing out another small benchmark gain,
  • comparing one foundation model to another on synthetic cases,

and more energy on:

  • multisite prospective studies,
  • human-AI interaction,
  • silent trials,
  • subgroup consequences,
  • workflow redesign,
  • and what happens six months after deployment (Wiens et al. 2019; Tikhomirov et al. 2026).

For 🏥 institutions

Institutions should treat AI less like a gadget and more like a clinical intervention plus infrastructure decision.

That means:

  • procurement scrutiny
  • local validation 👈
  • audit trails
  • consent and communication
  • clinician review
  • decommissioning plans
  • governance after launch, not just before it

Our practical reframing

🧠 Think

Not

“Is AI ready for medicine?”

But this

“Which uses of AI are justified, for whom, under what evidence, and with what continuing accountability?”

4. Sisyphus at rest

The cycle can break

The cycle is not inevitable.

The macular degeneration workflow study by Q. Chen et al. (2025) is a useful model.

  • it measured human performance with and without AI, across multiple rounds, with attention to time, calibration, and external validation

Note

That is not a study about whether the model is impressive.

It is a study about whether the system — the human plus the model — improves care.

What a responsible deployment looks like

The patient trajectory (pun intended) that avoids the cycle tends to share a structure:

  1. Narrowly defined task with a real comparator
  2. Silent prospective evaluation before any care changes
  3. Local validation on the population that will actually use it
  4. Monitored rollout with audit trails and override tracking
  5. Pre-committed criteria for decommissioning

Note

This is not slower than the Sisyphean rush to the peak.

It is faster — because it does not end in a backlash that resets the clock.

Suggestions

Khattak et al. (2025) provide an MLHOps checklist for clinical ML to be deployable, monitorable, and maintainable by:

  • preparing the data
  • engineering the pipeline
  • deploying with the workflow in mind
  • monitoring for drift
  • updating deliberately
  • governing for bias, privacy, and trust

Choose local, free-range, organic AI!

Final thought

The cycle becomes Sisyphean when we forget, again and again through our haste, to build in (and validate!) our ⛔️safeguards⛔️.

These safeguards represent choices about: what counts as evidence, what questions get asked before deployment, and how we handoff responsibilities.

🤔 Stop and think

Don’t be “anti-AI” or “pro-AI”.

Be more mindful of the “big picture



Thank you.

References

Abràmoff, Michael D., Philip T. Lavin, Michele Birch, Nilay Shah, and James C. Folk. 2018. “Pivotal Trial of an Autonomous AI-Based Diagnostic System for Detection of Diabetic Retinopathy in Primary Care Offices.” Npj Digital Medicine 1: 39. https://doi.org/10.1038/s41746-018-0040-6.
Bean, Andrew M., Rebecca E. Payne, Guy Parsons, et al. 2026. “Reliability of LLMs as Medical Assistants for the General Public: A Randomized Preregistered Study.” Nature Medicine 32: 609–15. https://doi.org/10.1038/s41591-025-04074-y.
Bernstein, Isaac A., Youchen Zhang, Devendra Govil, et al. 2023. “Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.” JAMA Network Open 6 (8): e2330320. https://doi.org/10.1001/jamanetworkopen.2023.30320.
Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noémie Elhadad. 2015. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–30. https://doi.org/10.1145/2783258.2788613.
Chen, Qian, T. D. L. Keenan, Elvira Agron, et al. 2025. “AI Workflow, External Validation, and Development in Eye Disease Diagnosis.” JAMA Network Open 8 (7): e2517204. https://doi.org/10.1001/jamanetworkopen.2025.17204.
Chen, Sully F., Anton Alyakin, Andreas Seas, et al. 2026. “LLM-Assisted Systematic Review of Large Language Models in Clinical Medicine.” Nature Medicine 32: 1152–59. https://doi.org/10.1038/s41591-026-04229-5.
Han, Ryan, Julián N. Acosta, Zahra Shakeri, John P. A. Ioannidis, Eric J. Topol, and Pranav Rajpurkar. 2024. “Randomised Controlled Trials Evaluating Artificial Intelligence in Clinical Practice: A Scoping Review.” The Lancet Digital Health 6 (5): e367–73. https://doi.org/10.1016/S2589-7500(24)00047-5.
Khattak, Faiza, Vallijah Subasri, Amrit Krishnan, Chloe Pou-Prom, Sedef Akinli-Kocak, Elham Dolatabadi, Deval Pandya, Laleh Seyyed-Kalantari, and Frank Rudzicz. 2025. MLHOps: Machine Learning Health Operations.” IEEE Access. https://doi.org/10.1109/ACCESS.2024.3521279.
Liu, T. Y. Alvin, and Neil M. Bressler. 2026. “JAMA Ophthalmology—Eye on AI.” JAMA Ophthalmology 144 (2): 119–20. https://doi.org/10.1001/jamaophthalmol.2025.5454.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.
Teng, Clare W., Saawan D. Patel, Andrew J. Barkmeier, T. Y. Alvin Liu, David Myung, Jeffrey Henderer, James Liu, Eric Hansen, and Lama A. Al-Aswad. 2026. “Autonomous Artificial Intelligence in Diabetic Retinopathy Testing—Lessons Learned on Successful Health System Adoption.” Ophthalmology Science 6 (1): 100935. https://doi.org/10.1016/j.xops.2025.100935.
Tikhomirov, Lana, C. Semmler, N. Prizant, et al. 2026. “A Scoping Review of Silent Trials for Medical Artificial Intelligence.” Nature Health. https://doi.org/10.1038/s44360-025-00048-z.
Wiens, Jenna, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu, Finale Doshi-Velez, Kenneth Jung, et al. 2019. “Do No Harm: A Roadmap for Responsible Machine Learning for Health Care.” Nature Medicine 25 (9): 1337–40. https://doi.org/10.1038/s41591-019-0548-6.