The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses

1 Dept. of Chemistry and Applied Biosciences, ETH Zurich 2 Dept. of Computer Science, ETH Zurich 3 Dept. of Health Sciences and Technology, ETH Zurich * Equal contribution    Corresponding author

Abstract

Large language models (LLMs) show promise for supporting clinicians in diagnostic communication by generating explanations and guidance for patients. Yet their ability to produce outputs that are both understandable and empathetic remains uncertain. We evaluate two leading LLMs on medical diagnostic scenarios, assessing understandability using readability metrics as a proxy and empathy through LLM-as-a-Judge ratings compared to human evaluations.

The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions. However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support. These patterns underscore the need for systematic calibration to ensure equitable patient communication.

Evaluation Framework

Key Findings

📚 Overcomplexity Problem

Both models generate text at 9th-13th grade reading level, well above the recommended 6th-8th grade for public health materials. This overcomplexity may reinforce health literacy disparities and limit accessibility for general patient populations.

🧠 Cognitive vs. Affective Empathy Split

Cognitive empathy remains consistently high and stable (~2.8-3.0) across all demographic groups. In contrast, affective empathy shows substantial variation, shaped by medical diagnosis, patient education level, and the choice of evaluator model.

🏥 Diagnosis as Primary Driver

Medical condition has the strongest and most consistent effect on affective empathy. Alzheimer's disease receives the highest empathy scores (~2.2-3.0), while chronic heart disease receives the lowest (~1.6-2.3).

🎓 Education Inverse Relationship

Patients with medical degrees receive lower affective empathy than those with high school education. This suggests LLMs shift to a more technical, less emotionally expressive communication style when addressing medically trained individuals.

⚖️ Systematic Self-Evaluation Bias

GPT inflates its own affective empathy scores by +0.333 points, while Claude deflates its own scores by -0.256 points. These self-evaluation biases hold consistently across all demographic groups, revealing systematic patterns in model self-assessment.

👥 Overconfidence vs. Human Judges

GPT rates its own outputs significantly higher than human annotators do (all p < 0.05). Critically, LLMs fail to detect demographic biases that human evaluators identify, such as lower empathy toward African females compared to European female in chosen gpt responses.

Scenario Design

Our evaluation framework encompasses 156 diagnostic scenarios that systematically vary across multiple dimensions:

Evaluation Methodology

We assess model outputs using two complementary approaches:

Broader Impacts

Our study reveals that LLMs, if deployed in medical contexts without careful safeguards, risk amplifying existing health inequities through excessive complexity and biased empathy responses. Transparent, bias-aware evaluation is critical before any clinical integration.

Important Note: This study evaluates LLM-generated synthetic diagnostic communications as an exploratory framework for investigating potential biases. It does not endorse the use of LLMs in actual clinical settings. All scenarios are synthetic; no real patient information is used.

BibTeX

@inproceedings{
    yao2025the,
    title={The Biased Oracle: Assessing {LLM}s{\textquoteright} Understandability and Empathy in Medical Diagnoses},
    author={Jianzhou Yao and Shunchang Liu and Guillaume Drui and Rikard Pettersson and Alessandro Blasimme and Sara Kijewski},
    booktitle={The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance},
    year={2025},
    url={https://openreview.net/forum?id=mhtDi2d4ZC}
}