Background
Diagnostic errors committed by radiologists can have serious consequences on patient outcomes. These diagnostic errors have considerable impact in the field of oncology because most errors occur during radiological diagnosis. Radiologist fatigue is a significant contributor to diagnostic errors and is expected to increase as physician workloads have climbed in recent years. Both eye gaze and speech patterns have been studied to identify possible indicators of fatigue outside the field of radiology. Recent developments in machine learning have enabled development of more successful fatigue models. Southwest Research Institute previously collaborated on a project to perform automated pulmonary nodule detection using data from radiology readings. This project produced a dataset of the speech, gaze, and fatigue survey data collected from 23 radiologists as they read chest radiographs containing potential nodules. This dataset provides the opportunity to build novel models to predict radiologist fatigue and to explore the impact on diagnostic errors.
Approach
Radiologist performance was determined by intersection over union of their annotations compared to ground truth. The Swedish Occupational Fatigue Inventory measured self-reported fatigue metrics on five dimensions: lack of energy, physical exertion, physical discomfort, lack of motivation, and sleepiness. We built custom tools to extract salient gaze features (i.e., saccades, fixations, and pupillometry parameters), and speech features (i.e., frequency, energy, and spectral parameters). These feature sets were used independently to compare the mean subject-fold cross validation performances of a 12-layer feed-forward neural network, support vector classifier, linear regression, and DeepConvLSTM. Combinations of speech, gaze, and fatigue features were used to predict radiologist performance and fatigue levels to determine the best model and feature set.
Accomplishments
A suite of scripts has been developed to analyze physician performance on the pulmonary nodule detection task to identify diagnostic errors and extract gaze and speech features. Statistical analysis of fatigue surveys revealed that physical discomfort was significantly correlated (Recall r=0.31, p>0.5 and F1 r=0.31, p<0.5) with task performance. Across the models individually trained on gaze and speech features, the 12-layer feed-forward neural network performed best. When the 12-layer feed forward neural network was trained to predict performance and fatigue levels on combinations of these feature sets (gaze, speech, gaze-speech, gaze-fatigue, speech-fatigue, gaze-speech-fatigue), the highest performing model was the combined gaze and fatigue (gaze-fatigue) features model. The model that predicted performance levels had a mean accuracy of 0.74, a standard deviation of 0.10, and a maximum accuracy of 0.89 across all the subject folds. Additionally, the model that predicted fatigue levels (excluding physical discomfort) had a mean accuracy of 0.98 and a standard deviation of 0.04.