Results

Figure 3Model performance in the holdout test dataset and external validation dataset as a function of observation period
(A) AUROC as a function of time after ICU admission and (B) AUROC as a function of time before ICU discharge. The metrics for each timepoint in the graphs are displayed in the tables below with 95% CIs in parentheses. AUROC=area under the receiver operating characteristic curve. MCC=Matthews correlation coefficient. PPV=positive predictive value. NPV=negative predictive value. LRP=likelihood ratio positive. LRN=likelihood ratio negative. Prop=proportion of the total number of test patients admitted at a given timepoint. ICU=intensive care unit.
TableBaseline characteristics of the ICU patients in the training, test, and external validation datasets
Data are n (%) or median (IQR). For patients with multiple admissions, the data provided are from the first admission. ICU=intensive care unit.

Figure 4The impact of the input features on predictions
(A) The model includes both continuous and binary input features. Continuous features vary from low to high values, whereas binary features are either present or absent. Each dot represents the impact of a feature on the mortality prediction for one patient at a given point in time. As new mortality estimates are hourly, any given patient can be represented multiple times, depending on the duration of the ICU admission. Dots to the left represent patients with feature values that pull the prediction towards survival and dots to the right represent patients with feature values that drag the prediction towards non-survival. (B) The theoretical impact of the input features in the original SAPS III model on predictions. The calculations are based on a distribution of SAPS III admission scores between 25 and 110. The calibration for northern Europe is used. ICU=intensive care unit. dept=department. NYHA=New York Heart Association. SAPS=Simplified Acute Physiology Score. *Binary feature. †Combined kidney and pancreas, or other transplantation.

Figure 5Impact of input features on the dynamic mortality prediction for a single patient using our model (A) and the original SAPS III score (B) for the first 3 days of ICU admission
(A) Lines show the mortality predictions from our model and the SAPS III score as they evolve over time (ie, higher values indicating higher mortality risk while lower values indicate higher chance of survival). The shaded areas show the three most important input features driving the 90-day mortality prediction towards either non-survival (orange) or survival (blue) during the first 72 h of the ICU admission. High opacity reflects high relative feature importance. Numbers are used to identify the features; labels are added whenever a feature is outranked by another. (B) The mortality prediction using the SAPS III model with the three most important features driving the prediction towards non-survival. For the depicted patient, there are no features pulling in the direction of survival. ICU=intensive care unit. SAPS=Simplified Acute Physiology Score. *In the machine learning model, the SAPS III oxygenation variable is split into its sub-components of mechanical ventilation and PaO2/FiO2 ratio.

Figure 6Impact of input features on the dynamic mortality prediction for the full holdout test dataset population (2825 admissions) for the first 3 days of ICU admission
The relative importance of all 44 features on the mortality predictions for the full holdout test dataset during the first 72 h of ICU admission. The left column shows contributions driving predictions towards non-survival whereas the right column shows those driving towards survival. ICU=intensive care unit. dept=department. NYHA=New York Heart Association. SAPS=Simplified Acute Physiology Score. *Combined kidney and pancreas, or other transplantation.
Discussion
In this study, we developed a risk prediction model providing dynamic, individual predictions of 90-day mortality of ICU patients. The model was trained on 44 binary and continuous SAPS III variables for more than 9000 patients hospitalised in four ICUs in the Capital Region of Denmark between 2011 and 2016. The model was updated at 1-h intervals and calibrated for more reliable predictions. Model performance increased over time and achieved a performance of AUROC of 0·88 (95% CI 0·87–0·89) and MCC of 0·57 (0·54–0·60) at time of discharge. The model was made explainable and the top features driving mortality prediction were identified both for an individual and the full holdout test dataset. Importantly, in the analysis of individual mortality predictions over time, we found that one feature can drive the prediction towards survival at one timepoint and towards non-survival at another. The predicted outcome varies: at time of admission, it encompasses in-ICU, in-hospital, and post-discharge mortality. Patients who die while at the ICU are probably quite different from patients who are discharged and die at home before the 90-day mark. Thus, the model adapts to account for the changing nature of the predicted outcome, making it more useful than one-off scores such as SAPS that are computed only once with data obtained during the first day of ICU admission. This finding underpins the importance of continuously updating decision support tools, which adapt to the evolving clinical picture and provide real-time guidance to clinicians. Such dynamic tools are likely to be more useful than the static scores that are currently implemented. Clinical decision making in the early stages of admission—eg, whether to commence treatment and how aggressively to treat a patient—is very different from decisions made later on, such as a decision whether to withdraw life-sustaining treatment.
We also found that certain features could compensate for one another. An example was the negative effect of having a low minimum SBP could be countered by a high maximum SBP within the same hour. Thus, the occurrence of low SBP values has less impact if the condition is correctable, which makes sense from a clinical point of view. Overall, we see that features have complex interactions over time and the ambiguity of features further emphasises the need for real-time machine learning-based decision support. We note that the features interact in a complex non-linear manner, unlike two-way or three-way interactions often used in generalised linear models. The LSTM architecture allows us to model complex, multidimensional interactions but this also makes clinical interpretation of the results more difficult and thus requires caution.
The choice of method reflects the nature of a dynamic patient-level prediction problem from the perspective of the clinician: a patient’s mortality risk is constantly evolving and depends on the past as well as the current condition. An LSTM network is a special kind of recurrent neural network composed of LSTM units capturing long-range dependencies from sequential data via gated cells, determining whether or not to maintain information based on the importance it assigns to the information. This way, an LSTM-based machine learning prediction model—unlike, for example, logistic regression models—both learns from information about the temporal development and the interaction between the features.
- Moreno RP
- Metnitz PGH
- Almeida E
- et al.
External validation studies have revealed AUROCs of 0·69 (95% CI 0·63–0·75) and 0·81 (0·79–0·93) in Denmark and Norway, respectively.
- Christensen S
- Johansen MB
- Christiansen CF
- Jensen R
- Lemeshow S
,
- Strand K
- Søreide E
- Aardal S
- Flaatten H
Our model had a higher predictive performance compared with these results, which we confirmed using an external dataset obtained after the study was completed.
- Meyer A
- Zverinski D
- Pfahringer B
- et al.
As in our study, they base their model on routinely collected data. However, they only report the discriminative performance of the model, not the calibration. This is an important issue if the model is intended for making predictions for single patients. Furthermore, the matter of model explainability is not addressed.
- Meiring C
- Dixit A
- Harris S
- et al.
In contrast to our study, they report that the best performance is accomplished around 2 days into the ICU admission. The reason for this discrepancy might be that they only use daily measurements for each feature and their neural network architecture is not well suited for dealing with time-series data. Hence, the full information hidden in temporal trends in the data was not exploited.
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).
,
- Cohen IG
- Amarasingham R
- Shah A
- Xie B
- Lo B
There are several examples of dubious conclusions drawn from automated decision models based on machine learning. An advantage with our model is that SHAP made our model explainable both in terms of the importance of individual features for ICU patient survival in general and those at patient level at any given timepoint. Thus, using such a model for decision support gives the clinician real-time information of the patient’s risk of dying and the specific features currently pulling towards non-survival. The importance of a continuously updated mortality prediction is shown by our finding that one feature can drive predictions towards either survival or non-survival depending on the timepoint of prediction during the ICU stay.
LOCF can be problematic in at least two ways: it can distort temporal covariate tendencies causing misclassification of exposures and bias in unpredictable ways, and it can introduce statistically dependent replicates. We would argue that it makes sense in our case, because the absence of a datapoint is not necessarily void of information. Indeed, much can be inferred when it comes to measurements, especially in a setting as controlled as that of the ICU: the very absence of, for example, a pH value might simply mean that the physician actively chose not to run the analysis again because there was no need. This is arguably often the reality, so carrying the most recent values forward as proxies for missing values might be clinically meaningful. Along this line, artificial missingness was introduced because of the up-sampling of variables not measured every hour. In this case, the use of LOCF is just an imitation of the reasoning of a medical professional, who would derive a clinical assessment on the basis of the available knowledge. The use of LOCF yields replicates that are not statistically independent and could be considered a form of pseudo-replication. This could affect model performance and generalisability, but through cross-validation and regularisation during training, we expect this to have little real effect.
Our study is retrospective and based on ICU data from a densely populated, but rather small geographical area in Denmark. Hence, the model might be biased and reflect the clinical guidelines and treatment decisions made in this area. However, using data from fairly homogeneous settings can render the model useful for exactly the kinds of patients that physicians encounter in their daily work. Besides, the physicians who recorded the data, ordered the tests, and intervened when they saw fit did this with the aim of providing the best possible patient care and did not consider that the data might subsequently be used for prediction purposes. In this way, although retrospective, the data are unlikely to reflect information bias that is otherwise known to haunt prospective studies.
During training, the model was optimised to predict the chance of survival 90 days from ICU admission given data series with a fixed length. The trained model is able to make predictions on new data of varying lengths between admission and discharge, but it is not very accurate at the point of ICU admission. At this point, there are few datapoints available for model training and LSTM models are explicitly optimised for time-series prediction. We acknowledge that an ideal model should have a better performance at early stages as well, but it would not be possible to extract information about the temporal trends, and the model would then serve another purpose than that of this model and would require a different design or a combination of methods.
The present model is based on relatively few variables taken from SAPS III and, as such, the study is intended as a proof of concept. However, replication in a large validation dataset obtained from another geographical region could verify its robustness and confirms that it is ready to be turned into a clinical decision support tool to be tested in a randomised controlled trial. We intend to do future studies using this model, and are working on implementing the model into our new electronic medical record system (Epic; Verona, WI, USA).
- Nielsen AB
- Thorsen-Meyer H-C
- Belling K
- et al.
Our model provides a more accurate prediction of mortality, probably due to the high granularity of the ICU data included in our LSTM model. Thus, adding more detailed disease history to the present model might increase the performance even further and increase the performance upon ICU admission. Additionally, much information is hidden in the clinical notes in which physicians and other health-care professionals collect detailed phenotypic data. Adding such data might also improve the predictive ability.
- Shah NH
- Milstein A
- Bagley SC
The next step in the process of establishing clinically applicable machine learning models is randomised clinical trials.
In conclusion, we developed an explainable LSTM model for ICU 90-day mortality prediction from a total dataset of more than 14 000 admissions of 11 000 patients from four mixed ICUs in Copenhagen, Denmark, with external validation. The predictive performance improved over the timecourse of an ICU stay. Model interpretation showed that input features can interact and compensate for one another and can pull towards survival at one timepoint and towards non-survival at another. None of these observations can be obtained from current static prognostic scores. Yet, before this kind of model can be used as a bedside tool, the results need to be confirmed in a randomised clinical trial.
SB and AP conceived the study, which was designed in detail by H-CT-M. H-CT-M did the data analysis, which was interpreted by H-CT-M, ABN, APN, BSK-H, PT, JS, TS, KB, SB, and AP. PJC, MH, LD, LS, TS, and PH extracted and handled the data. All authors contributed to the preparation of the Article and approved the final version.
LD, LS, and PH are employed by Daintel (as of Jan 1, 2020, Cambio has purchased Daintel), which is taking part in the BigTempHealth project funded by the Danish Innovation Fund , grant 5184–00102B . AP reports grants from Ferring and the Novo Nordisk Foundation. SB reports personal fees from Intomics and Proscion, as well as grants from the Novo Nordisk Foundation (grants NNF17OC0027594 and NNF14CC0001 ). All other authors declare no competing interests.
Credit: Google News