Responsible AI

What is standard, and what should be

Frank Rudzicz

2025-12-10

Why this talk, why now

  • AI is now embedded in:
    • 🧠 diagnosis & decision support
    • 🩻 imaging & monitoring
    • ✍️ clinical documentation and scribes
    • 📱 patient-facing tools and chatbots
  • 🏃🏻‍♀️“Standard practice” is emerging fast, often driven by vendors and early adopters.
  • ⛑️ But what clinicians actually need for safe, equitable care is often beyond what is deployed.
  • Today: we’ll contrast what is standard with what should be for AI+health.

Learning goals

By the end of this session, you will be able to:

  1. Identify different kinds of explainable AI relevant to clinical work.
  2. Recognize how bias manifests in AI models and workflows.
  3. Understand why bias and privacy can be at odds, and what good practice looks like.

A clinical vignette

Consider a widely discussed study of mortality risk prediction using “intelligible” models and patients presenting with pneumonia (Caruana et al. 2015).

  • This uses a generalized additive model with pairwise interactions (GA2M), so clinicians can see how each factor affects risk.
  • 😲 Asthma appears to decrease the risk of death from pneumonia. 😲
  • Clinically, this makes no sense
  • 💡 In practice, patients with asthma often receive more aggressive and timely care (ICU, closer monitoring)

From cool tech → clinical tool

  • Many AI projects stall in the gap between:
    1. 🔬 Model performance (AUC, F1, ROC curves), and
    2. 🩺 Clinical usefulness (safety, workflow fit, equity, trust).
  1. Standard” often means:
  • A high AUC (on select data)
  • A glossy vendor demo
  • Minimal transparency for clinicians
  1. Should be standard” includes:
  • Evidence on patients like yours
  • Better explanations
  • Bias and privacy risks explicitly managed
  • Real governance and monitoring

Part 1 – What is “standard” in 2025?

Where AI shows up in care today

Examples you may already see:

  • Triage & risk
    • Sepsis early warning scores, ward-deterioration prediction, and EHR-based readmission models are already in clinical evaluation and deployment (Henry et al. (2015); Churpek et al. (2016); Rajkomar et al. (2018)).
  • Diagnosis & imaging
    • Deep learning systems reach dermatologist-level performance in skin cancer classification and clinical-grade performance in computational pathology, with radiology now a major focus of review and regulation (Esteva et al. (2017); Campanella et al. (2019); Rajpurkar et al. (2022)).
  • Operations
    • Machine learning is applied to patient flow, bed management, and staffing/scheduling to anticipate crowding and align resources (El-Bouri et al. (2021); Knight, Aakre, et al. (2023); Renggli et al. (2025)).
  • Documentation
    • Ambient AI scribes and NLP-based tools support real-time note generation and coding, with early evidence of reduced documentation burden (Tierney et al. (2024); You, Rotenstein, et al. (2025); Ji et al. (2024)).
  • Patient-facing
    • Symptom checkers, LLM-assisted replies to patient messages, and fully generative mental-health chatbots are now studied in practice and even randomized trials (Semigran et al. (2015); Wallace et al. (2022); Small, Serenyi, et al. (2024); Heinz et al. (2025)).

“Standard” adoption is often piecemeal, vendor-driven, and opaque.

Clinicians are often looped in late—if at all.

Existing standards & guidance (1/2)

Many frameworks now define a minimum bar:

  • 🌎 WHO 🔗guidance on AI for health
    • Principles: autonomy, safety, transparency, accountability, inclusiveness, data protection.
    • Also touches on liabilility and governance
  • 🌎 WHO 🔗guidance on Generative AI in health
    • Recommends governance for large models used in clinical or public health contexts.
  • Professional bodies (e.g., CMA, specialty colleges)
    • Emphasize equity, human values, and patient well-being in AI adoption.

These shape what should count as “standard” in health AI but can be incredibly high-level.

Existing standards & guidance (2/2)

Technical and regulatory frameworks:

  • 🏈 🔗NIST AI Risk Management Framework (AI RMF)
    • Trustworthy AI: valid & reliable, safe, fair, transparent, accountable, privacy-enhancing, secure.
  • 🇪🇺 🔗EU AI Act
    • Treats many health AI systems as high-risk, requiring:
      • Risk management
      • Data governance
      • Documentation & logging
      • Human oversight
  • Privacy & data laws (PHIA, PIPEDA, HIPAA, GDPR, etc.)
    • Set guardrails on data collection, use, and sharing.

The gap between paper and practice

Common reality on the ground:

  • AI tools piloted with little transparency to front-line staff.
  • Limited or no:
    • 👥 Subgroup performance reporting
    • 🔎 Ongoing monitoring
    • 📈 Clear escalation pathways when models misbehave
  • Privacy teams may focus on consent forms and data-sharing agreements, while equity and explainability get less attention.

👉 We need to better understand how these work technically 👈

Part 2 – Explainable AI

Why explainability matters

  • Regulators increasingly demand it
  • The public increasingly expect clarity and fairness
  • Organizations need it to maintain trust
  • Internal teams need it for debugging and risk management

📚 (Doshi-Velez and Kim 2017; Lipton 2018; Barredo Arrieta et al. 2020)

Explanations ≠ Truth

But men may construe things after their fashion, / Clean from the purpose of the things themselves” (I. iii. 34–35).
— Cicero (via William Shakespeare)

Note

Our pneumonia model “explains” that asthma lowers mortality risk. Is that a helpful explanation—or a red flag?

Properties of Good Explanations

  • Faithfulness: Matches the model’s true logic/behaviour
  • Plausibility: Intuitively satisfying to a human
  • Contextual: Tailored to user knowledge level
  • Contrastive: “Why A, not B?
  • Actionable: Can guide decisions or fixes

Warning

Misuse risk: Plausible stories can mislead if they aren’t faithful. Don’t mistake rationalizations for reasons. Beware 🔗 confirmation bias.

Three levels of explanation

  1. 🧮 Model-level (‘explanation’)
    • Overall: What generally drives predictions?
    • e.g., Caruana’s curves for age, O₂ saturation, asthma.
  2. 🔭 Prediction-level (‘interpretation’)
    • Individual: Why was this patient flagged?
    • e.g., a bar chart of contributions, or counterfactuals, for this pneumonia patient.
  3. 🏥 System-level (‘transparency’)
    • Workflow: Where in the care pathway does AI influence decisions?
      Who can override it?

    • e.g., use-case docs, governance charts, escalation rules.

We need all three in clinical settings.

Visual Explanations

Heatmaps (for images)

From Cinà et al. (2023)

Text attention overlays

From Feng, Shaib, and Rudzicz (2020)

Explaining our pneumonia example

  • Our pneumonia model outputs a risk score for each admission.
  1. Global explanation (GA2M curves)
    • Age: risk rises sharply after ~65
    • Low oxygen saturation: risk increases steeply below a threshold
    • Very high temperature: risk increases
    • ⚠️ Asthma: risk appears to decrease risk → a clinical red flag 🚩
  1. Local explanation (e.g., SHAP)
    • Shows which features push this patient’s risk up or down
    • Ask: “Do these reasons match what I know about this patient?”

Takeaways

  • Explanations help debug models.
  • They do not guarantee fairness or causal truth.
  • If we had simply dropped “asthma”, the problem might persist, but we’d be blind to it.

🔬 LLM self-explanations

LLMs can produce explanations along with their response, called self-explanations. (Huang et al. 2023)

For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as “fantastic” and “memorable” in the review). How good are these automatically generated self-explanations?

🔬 LLM self-explanations

Self-explanations cannot be assumed to be faithful without structured validation. (Madsen, Chandar, and Reddy 2024)

  • LLMs can produce convincing self-explanations (e.g., chain-of-thought)
  • Faithfulness varies by task, model, and explanation type
  • Risk: humans tend to over-trust fluent rationales
  • Solution?: Get the predicting model to produce 🔄 counterfactuals, and then run those counterfactuals.

When they help, when they hurt

👍 When explanations help

  • Boundary cases where you might be swayed either way.
  • Education: showing trainees how features interact.
  • Trust calibration: seeing when the model behaves sensibly.
  • Error detection: spotting obviously bizarre patterns.

👎 When explanations mislead

  • Cognitive overload: too much detail in a busy clinic = ignored explanations.
  • False reassurance: mathematically correct visualizations that hide structural bias. (e.g., making asthma look protective)
  • Post-hoc methods: oversimplify, can oversell a weak or biased model

Trade-offs

Factor Black-box Models Interpretable Models
Accuracy Often higher Sometimes lower
Transparency Low High
Trust Requires justification Implicit
Flexibility High Often limited
  • 🤔 Where would you accept a small performance hit in exchange for clarity?
    • ⬆️ When stakes are high: prefer explainability
    • ⬇️ When stakes are low: go for performance

Questions to ask about explainability

Before using any AI tool, clinicians can ask:

  1. What will I actually see on screen when using this model?
  2. Is there a simple local interpretation for each prediction (like a contribution chart)?
  3. Do global explanations (like the pneumonia curves) make clinical sense—and if not, who reviews them?
  4. How easy is it to override or ignore the model when it conflicts with clinical judgment?
    • When do we revisit clinical judgment itself?

Part 3 – Bias vs privacy in AI

Bias in AI: what it is, why it matters

  • AI bias: systematic error in model outputs that disproportionately harms or benefits specific groups
  • Common sources:
    • Data – unrepresentative cohorts; missing marginalized groups
    • Labels – proxies like cost instead of need; noisy clinical judgment
    • Modelling – features that proxy race, income, language, disability
    • Deployment – who gets the tool, how it’s used, how it’s monitored
  • Why it matters in health:
    • Bias can amplify historical inequities
    • Harms may concentrate in already under-served communities

Where bias enters the pipeline

From Fusar-Poli et al. (2022).

Clinical consequences of bias

From Seyyed-Kalantari et al. (2021)

Largest underdiagnosis rates in Female, 0-20, Black, and Medicaid insurance patients.

Why privacy is hard

  • 😴 Simple “de-ident” (e.g., removing names) is not enough
    • Combination of quasi-identifiers (age, postal code, dates, rare conditions) can re-identify people
    • External data (social media, fitness apps, location traces) makes linkage easier
  • ⚔️ Tension:
    • Fairness work often needs protected attributes (race, gender, etc.)
      • You can’t show a model is fair if you can’t tell to whom it’s unfair
    • Privacy practice often wants to strip those fields

Layered technical safeguards

No single technique is perfect.
We combine porous layers:

  1. \(k\)-anonymity and related de-identification methods
  2. Obfuscation (adding “noise” at the datum level)
  3. Differential Privacy (adding formal noise at the model level)
  4. Federated Learning (keep data local, move the model)

Note

Together, these form a privacy-preserving AI toolkit that must still be checked for fairness impacts.

1. \(k\)-anonymity & de-identification

  • \(k\)-anonymity
    • A dataset is “\(k\)-anonymous” if each record is indistinguishable from at least \(k-1\) others
    • Implemented via generalization (e.g., age 37 → age band 30–39)
      and suppression (drop rare combinations)
  • 👍 Pros
    • Reduces re-identification risk from linkage attacks
    • Connects to regulatory guidance (e.g., HIPAA Safe Harbor lists)
  • 👎 Cons
    • Utility can drop sharply when \(k\) is large
    • Does not protect against all inference attacks

2. Obfuscation

  • Obfuscation: adding “noise” at the level of individual behaviour or content
    • Randomized clicks, fake queries, dummy GPS traces, etc.
  • Practically:
    • Too permissive: re-identification is easy
    • Too restrictive: clinical detail is lost

3. Differential Privacy

  • 💡Solution: Don’t obfuscate the data, obfuscate the model
  • Differential Privacy (DP) ensures that the probability of any output is nearly the same whether or not an individual’s data is included in the dataset.
    • This guarantees indistinguishability: attackers cannot confidently tell if a specific person’s data was used.
  • It accomplishes this through adding noise to the model

(Dwork and Roth 2014)

From 🔗here

3. Differential Privacy

  • U.S. Census (2020) : First national census to implement DP at scale. Protected sensitive sub-population counts but raised debates over accuracy in small communities.
  • Canadian Research 🇨🇦: Applied DP to health datasets (Nova Scotia, Ontario) to enable epidemiological studies without exposing patients.
    • Aligns with values of data minimization & consent in PIPEDA, proposed CPPA, and Nova Scotia’s PHIA.
    • ⚠️ But increasing privacy can decrease fairness! ⚠️ (Dadsetan et al. 2024)

4. Federated Learning

  1. Local training
    • Each device/institution trains a model update using its own data.
  2. Aggregation
    • Updates (gradients, parameters) are sent to a central server.
    • The server aggregates updates into a global model.
  3. Privacy layers
    • Differential Privacy: noise added to updates.
    • Secure aggregation: cryptographic protocols ensure only aggregated results are visible.

From 🔗here

4. To be continued…

(Zeng and Rudzicz 2025)

A bias checklist for clinicians

  • 1. Who is in the data?
    • Are patients like ours represented?
    • Are some groups (e.g., Indigenous, racialized, rurally located) rare?
  • 2. What is the model optimizing?
    • Exactly outcome is it predicting (mortality, readmission, cost, workload…)?
    • Does the score gate access?
  • 3. Are groups evaluated separately?
    • Are performance metrics reported by age, sex, gender, race, …?
  • 4. How are sensitive fields handled?
    • Are PHI kept under strict governance for fairness and safety monitoring, rather than simply deleted “for privacy”?
  • 5. Who is accountable over time?
    • Is there a plan for ongoing monitoring, recalibration, and escalation when inequities are found? What team owns this?

Tip

You rarely get both fairness and privacy “for free”.
Ask explicitly how the system manages both risks before you endorse deployment.

Part 4 – Next steps

What you can do next week

Concrete actions for different roles:

  • 👩🏽‍⚕️ Clinicians
    • Ask the checklist questions in form committees, procurement, and pilots.
    • Document when AI outputs conflict with clinical judgment.
  • 👩‍💼 Leaders / administrators
    • Establish or strengthen an AI governance group with clinical, ethics, legal, and patient representation.
    • Require vendors to provide model cards, subgroup performance, and monitoring plans.
  • 👨🏻‍🔬 Researchers / informatics
    • Build explainability for clinicians, not just for technical audiences.
    • Integrate equity and privacy trade-offs into study designs.

Q&A

  1. What’s one AI system you already use that concerns you?
  2. Where in your workflow would better explainability actually help?
  3. What is one equity or privacy concern in your setting that you’d like AI to improve?

Thank you.

References

Barredo Arrieta, Alejandro, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, et al. 2020. “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI.” Information Fusion 58 (June): 82–115. https://doi.org/10.1016/j.inffus.2019.12.012.
Campanella, Gabriele, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. 2019. “Clinical-Grade Computational Pathology Using Weakly Supervised Deep Learning on Whole Slide Images.” Nature Medicine 25 (8): 1301–9. https://doi.org/10.1038/s41591-019-0508-1.
Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noémie Elhadad. 2015. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–30. ACM. https://doi.org/10.1145/2783258.2788613.
Churpek, Matthew M, Timothy C Yuen, Caryn Winslow, David O Meltzer, Michael W Kattan, and Dana P Edelson. 2016. “Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards.” Critical Care Medicine 44 (2): 368–74.
Cinà, Giovanni, Tabea Röber, Rob Goedhart, and Ilker Birbil. 2023. “Semantic Match: Debugging Feature Attribution Methods in XAI for Healthcare.” https://doi.org/10.48550/arXiv.2301.02080.
Dadsetan, Ali, Dorsa Soleymani, Xijie Zeng, and Frank Rudzicz. 2024. “Can Large Language Models Be Privacy Preserving and Fair Medical Coders?” arXiv. https://doi.org/10.48550/arXiv.2412.05533.
Doshi-Velez, Finale, and Been Kim. 2017. “Towards A Rigorous Science of Interpretable Machine Learning.” arXiv. https://doi.org/10.48550/arXiv.1702.08608.
Dwork, Cynthia, and Aaron Roth. 2014. https://doi.org/10.1561/0400000042.
El-Bouri, Raja et al. 2021. “Machine Learning in Patient Flow: A Review.” Progress in Biomedical Engineering 3 (2): 022002.
Esteva, Andre, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639): 115–18. https://doi.org/10.1038/nature21056.
Feng, Jinyue, Chantal Shaib, and Frank Rudzicz. 2020. “Explainable Clinical Decision Support from Text.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1478–89. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.115.
Fusar-Poli, Paolo, Mirko Manchia, Nikolaos Koutsouleris, David Leslie, Christiane Woopen, Monica E. Calkins, Michael Dunn, et al. 2022. “Ethical Considerations for Precision Psychiatry: A Roadmap for Research and Clinical Practice.” European Neuropsychopharmacology 63 (October): 17–34. https://doi.org/10.1016/j.euroneuro.2022.08.001.
Heinz, Michelle V, Donald M Mackin, Benjamin M Trudeau, et al. 2025. “Randomized Trial of a Generative AI Chatbot for Mental Health Treatment.” NEJM AI.
Henry, Katharine E, David N Hager, Peter J Pronovost, and Suchi Saria. 2015. “A Targeted Real-Time Early Warning Score (TREWScore) for Septic Shock.” Science Translational Medicine 7 (299): 299ra122. https://doi.org/10.1126/scitranslmed.aab3719.
Huang, Shiyuan, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H. Gilpin. 2023. “Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations.” arXiv. https://doi.org/10.48550/arXiv.2310.11207.
Ji, Z. et al. 2024. “A Unified Review of Deep Learning for Automated Medical Coding.” ACM Computing Surveys.
Knight, Daniel R, Christopher A Aakre, et al. 2023. “Artificial Intelligence for Patient Scheduling in the Real-World Health Care Setting: A Metanarrative Review.” Health Policy and Technology 12 (4).
Lipton, Zachary C. 2018. “The Mythos of Model Intepretability.” Queue 16 (3). https://doi.org/10.1145/3236386.
Madsen, Andreas, Sarath Chandar, and Siva Reddy. 2024. “Are Self-Explanations from Large Language Models Faithful?” arXiv. https://doi.org/10.48550/arXiv.2401.07927.
Rajkomar, Alvin, Eyal Oren, Kai Chen, Andrew M Dai, Noemie Hajaj, Michaela Hardt, et al. 2018. “Scalable and Accurate Deep Learning with Electronic Health Records.” Npj Digital Medicine 1: 18. https://doi.org/10.1038/s41746-018-0029-1.
Rajpurkar, Pranav, Emily Chen, Imon Banerjee, and Matthew P Lungren. 2022. “AI in Radiology: Current Applications and Future Directions.” Radiology 302 (3): 473–87.
Renggli, Florian J, Theresa Huber, Seraina Gysin, et al. 2025. “Integrating Nurse Preferences into AI-Based Scheduling Methods: Qualitative Study.” JMIR Formative Research 9: e67747.
Semigran, Hannah L, Jeffrey A Linder, Courtney Gidengil, and Ateev Mehrotra. 2015. “Evaluation of Symptom Checkers for Self Diagnosis and Triage: Audit Study.” BMJ 351: h3480. https://doi.org/10.1136/bmj.h3480.
Seyyed-Kalantari, Laleh, Guanxiong Liu, Matthew McDermott, Irene Chen, and Marzyeh Ghassemi. 2021. “Medical Imaging Algorithms Exacerbate Biases in Underdiagnosis.” Preprint. In Review. https://doi.org/10.21203/rs.3.rs-151985/v1.
Small, William R, Adam Serenyi, et al. 2024. “Large Language Model-Based Responses to Patients’ in-Basket Messages.” JAMA Network Open 7 (7): e2422399.
Tierney, Amanda A, Christopher A Longhurst, Lisa S Rotenstein, et al. 2024. “Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation.” NEJM Catalyst Innovations in Care Delivery 5 (3). https://doi.org/10.1056/CAT.23.0404.
Wallace, W. et al. 2022. “The Diagnostic and Triage Accuracy of Digital and Online Symptom Checker Tools: A Systematic Review.” Npj Digital Medicine 5: 56.
You, Jason G, Lisa S Rotenstein, et al. 2025. “Ambient Documentation Technology in Clinician Office Visits and Clinician Experience of Documentation Burden and Burnout.” JAMA Network Open.
Zeng, Xijie, and Frank Rudzicz. 2025. “How to Recover Long Audio Sequences Through Gradient Inversion Attack With Dynamic Segment-Based Reconstruction.” In Interspeech, 5118–22. Rotterdam, The Netherlands. https://doi.org/10.21437/Interspeech.2025-244.