Abstract
This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. We used a DenseNet121 model pre-trained MIMIC-CXR dataset for deep learning-based multi-label classification using ground truth labels from radiology reports extracted using the CheXpert and CheXbert Labeler. We compared the performance of the 14 chest X-ray labels on the MIMIC-CXR and Veterans Healthcare Administration chest X-ray dataset (VA-CXR). The validation of ground truth and the assessment of multi-label classification performance across various NLP extraction tools revealed that the VA-CXR dataset exhibited lower disagreement rates than the MIMIC-CXR datasets. Additionally, there were notable differences in AUC scores between models utilizing CheXpert and CheXbert. When evaluating multi-label classification performance across different datasets, minimal domain shift was observed in the unseen VA dataset, except for the label “Enlarged Cardiomediastinum.” The subgroup with the most significant variations in multi-label classification performance was study year. These findings underscore the importance of considering domain shift in chest X-ray classification tasks, paying particular attention to the temporality of the exam. Our study reveals the significant impact of domain shift and demographic factors on chest X-ray classification, emphasizing the need for improved transfer learning and robust model development. Addressing these challenges is crucial for advancing medical imaging research and improving patient care.