Investigating racial bias in AI-driven cardiac imaging

Artificial intelligence is playing an increasingly prominent role in automating segmentation in cardiac magnetic resonance imaging. However, concerns persist about racial bias stemming from imbalanced training datasets. Tiarna Lee, a PhD candidate at King’s College London, is exploring the root causes of this bias through her research, discussed here, to develop fairer and more equitable models.

Cardiac magnetic resonance (CMR) imaging is widely used for diagnosing and predicting cardiovascular conditions. However, studies have shown that AI-based segmentation of cardiac structures can exhibit bias.¹

When training datasets are imbalanced, AI models tend to perform better on individuals from majority groups than those from minority groups. This disparity can lead to less accurate clinical biomarkers for under-represented populations, resulting in poorer diagnosis, prognosis and treatment. Increasing the diversity of training data can improve model performance for these under-represented groups.

Bias and the limitations of AI models

AI models are susceptible and influenced by the data on which they are trained. For example, if a model is trained on 99 images of healthy patients and only one image of a patient with a particular disease, it will become much better at recognising healthy patients.

A key concern is that AI models may rely on superficial patterns, such as race, as a bypass for diagnosis. This phenomenon, known as shortcut learning, occurs when a model makes decisions based on irrelevant or misleading features rather than the true underlying indicators of disease.

For example, imagine two hospital wards: one uses Scanner A to image severely ill Covid-19 patients, while the other uses Scanner B for milder cases. The AI model might learn to associate Scanner A with severe illness, not because it understands the disease, but because it picks up on subtle scanner-specific image artefacts. As a result, if a severely ill patient is examined using Scanner B, the model might wrongly classify them as less sick due to this incorrect shortcut.

This type of flawed reasoning can also apply to race. If disease prevalence differs between racial groups in the training data, the model might learn to use race as a proxy for diagnosis. Such bias can undermine clinical decision-making, potentially leading to misdiagnosis or unequal treatment depending on the scanner or other spurious correlations learned by the AI.

What did our cardiac imaging AI study entail?

Our study used a race classification model to determine whether CMR images could be used to identify a patient’s race.² We found that not only the raw images but also their segmentations contained enough information for the model to classify race accurately.

To understand what features the model was using, the regions contributing most to the classifications were analysed. Surprisingly, the model relied on areas outside the heart, such as subcutaneous fat and image artefacts, rather than cardiac structures. When the images were cropped to focus solely on the heart, effectively removing surrounding fat and artefacts, the model’s ability to classify race dropped significantly. Conversely, when we used a dataset in which the heart was blurred out entirely, essentially removing it, the model could still accurately classify race. This suggests that most of the race-related information recognisable to this model was located outside the heart.

We investigated whether cropping the images could reduce bias in segmentation models. Despite removing features such as the subcutaneous fat from the images, the segmentation models still exhibited racial bias. We then performed confounder analysis and found that factors such as the MRI scan year and high-density lipoprotein cholesterol were not correlated with segmentation performance for White subjects but were correlated for Black participants. This indicates that underlying biases persist even after removing visible race-related features.

Increasing the representation of Black subjects in the training data, from 0% to 25%, substantially improved model performance for this group without negatively compromising accuracy for the White majority group. This highlights the importance of diverse and balanced datasets in developing fairer AI models.

How effective was cropping images around the heart in reducing AI bias?

Cropping the images around the heart reduced the model’s ability to classify race, with accuracy dropping to 55% – close to the 50% chance level for binary classification. This suggests that images of the heart alone contain little race-related information, and that cropping can reduce bias in classification models.

However, this effect did not translate to segmentation models, where bias remained even after cropping. It is important to note that the cropping in our study was based on ground-truth segmentations, which are not always available in real-world clinical settings, such as when a patient undergoes cardiac imaging. To implement this approach in practice, an additional method would be needed to automatically identify and crop the cardiac region. A potential solution could be using a bounding box detection model to localise and extract the heart region before segmentation.

Bias mitigation methods can ensure that AI models perform fairly across different demographic groups. For example, we found that oversampling under-represented groups effectively improved model performance for minority populations, aligning with the performance seen in the majority group.³ Oversampling works by presenting images from the under-represented group to the AI model more frequently during training, giving equal importance and helping the model learn balanced representations.

In contrast, other bias mitigation techniques, and even combinations, were less effective in reducing disparities. This highlights the importance of choosing the right mitigation strategy based on the specific context and data.

The potential risk of bias beyond CMR scans

Biases are prevalent across various medical imaging modalities. For instance, chest X-ray classification models were found to be less accurate for certain demographic groups, including younger individuals, females, patients under 20 years of age, as well as Black and Hispanic patients – all of whom experienced higher rates of underdiagnosis.⁴

Another study found that these models performed more accurately for male and older patients, highlighting disparities based on age and sex.⁵

Similar biases have been observed in dermatology.⁶ Model performance has varied based on age, sex and skin tone, with notably poorer results for individuals with darker skin tones.⁷ However, fine-tuning models using more diverse datasets helped reduce these performance gaps.

Research has also shown that self-identified race can be predicted from chest X-ray images alone, showing that the results are generalisable to other domains.⁸ The authors highlight that adopting a ‘colourblind’ approach to training AI models may not be feasible, as the models may still infer race from subtle, non-obvious features.

Machine learning models are often ‘black boxes’, so the reasons for decisions are not interpretable and transparent. To build trust in machine learning models, developers must implement methods that make these decisions more apparent to clinicians and patients. This will also help uncover potential biases in the systems.

Conclusion

Our study explored the root causes of bias in AI-driven CMR segmentation. We found that the differences between White and Black subjects mainly stem from variations in the images rather than the segmentations. These variations appear to be linked to differences in body fat composition outside the heart, which likely drive the distributional shift and resulting bias. While cropping the images to focus on the heart reduces this bias, it does not eliminate it. These findings will be important for researchers striving to develop fairer AI models for CMR segmentation.

These insights are not limited to cardiac imaging, with similar biases having been observed across other modalities, including chest X-rays and dermatology, emphasising the generalisability of our findings.

It is crucial to develop and adopt bias mitigation strategies, ensure diversity in training data and foster transparency through explainable models. Multidisciplinary collaboration between data scientists, clinicians and imaging specialists will also be key to building equitable and trustworthy AI tools that serve all patient populations fairly.

Author

Tiarna Lee MEng
School of Biomedical Engineering and Imaging Sciences, King’s College London, UK

References

Lee T et al. A Systematic Study of Race and Sex Bias in CNN-Based Cardiac MR Segmentation. STACOM 2020 conference paper. 28 January 2023.
Lee T et al. An investigation into the causes of race bias in artificial intelligence-based cine magnetic resonance segmentation. Eur Heart J Dig Health 2025;6(3):350–8.
Lee T et al. Does a Rising Tide Lift all Boats? Bias Mitigation for AI-based CMR Segmentation. arXiv.2503.17089. [Accessed June 2025].
Seyyed-Kalantari L et al. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in underserved patient populations. Nature Med 2021;27:2176–82.
Seyyed-Kalantari L et al. CheXclusion: Fairness gaps in deep chest X-ray classifiers. Pac Symp Biocomput 2021:26:232–43.
Abbasi-Sureshjani S et al. Risk of training diagnostic algorithms on data with demographic bias. Lecture Notes in Computer Science 2020;12446:183–92.
Daneshjou R et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv 2022;8(32):eabq6147.
Gichoya J et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health 2022;4(6):e406–14.