For model development, we obtained data on 12 616 patients with 15 615 ICU admissions. 11 492 patients with 14 190 ICU admissions were eligible for inclusion in the model development dataset, of which we allocated approximately 20% of patients (2299 patients with 2825 admissions) to the holdout test dataset (figure 2). The table shows baseline characteristics of the patients in the training dataset and holdout test dataset using the data from their first ICU admission. In the development dataset, the median age was 65 years (IQR 52–75) and 4816 (41·9%) were female. 1815 (15·7%) patients died in the ICU, 3389 (29·5%) in hospital, and 3802 (33·1%) by 90 days after ICU admission.
When predicting 90-day mortality after ICU admission, the predictive performance of our model increased over time (figure 3). The AUROC upon ICU admission in the holdout test dataset was 0·73 (95% CI 0·71–0·74) and increased to 0·85 (0·84–0·87) at 72 h. The corresponding MCCs were 0·29 (0·25–0·33) and 0·50 (0·46–0·53), respectively. When evaluating the performance relative to time elapsed since admission, for some patients, the time of prediction will approach the time of death. To deal with this issue, we also evaluated the predictive performance relative to the time of discharge (figure 3). At time of discharge from the ICU, the model achieved an AUROC of 0·88 (0·87–0·89) in the holdout test dataset, whereas 24 h before discharge the AUROC was 0·82 (0·80–0·84); the corresponding MCCs were 0·57 (0·54–0·60) and 0·46 (0·42–0·50), respectively.
We assessed the external validity of the model on our external validation dataset, comprising 5827 unique ICU patients with a total of 6748 admissions (figure 2). Overall, the predictive ability of the model was less accurate in the external validation dataset according to AUROC or MCC, with an MCC of 0·29 (95% CI 0·27–0·32) and AUROC of 0·75 (0·73–0·76) at admission, 0·41 (0·39–0·44) and 0·80 (0·79–0·81) after 24 h, 0·46 (0·43–0·48) and 0·82 (0·81–0·83) after 72 h, and 0·47 (0·44–0·49) and 0·83 (0·82–0·84) at the time of discharge (figure 3). However, in the initial part of the ICU stay, the model performed slightly better in the external validation dataset compared with the holdout test dataset. Given differences in mortality rates and premorbid status of the patients in the two populations, such as a substantially lower proportion of patients with heart failure in the training data (table), a decrease in model performance was to be expected.
TableBaseline characteristics of the ICU patients in the training, test, and external validation datasets
Data are n (%) or median (IQR). For patients with multiple admissions, the data provided are from the first admission. ICU=intensive care unit.
The calibration plots show that the hourly predictions, taken at face value, consistently overestimate the risk, whereas the isotonic predictions lie snugly around the diagonal (appendix p 3); early and late predictions deviate more, probably due to less available information (early predictions) and fewer patients (late predictions). Very early predictions are generally inferior (ie, less well calibrated), but the estimates converge after a few hours (appendix p 4). The compound isotonic calibration slope of 1·00 (95% CI 0·99 to 1·01) and intercept of −0·01 (−0·01 to −0·01) are close to ideal and robust to changes in binning of predictions (appendix p 7).
When considering the contribution of each of the 44 features in the model (figure 4), it is not surprising that age at admission has the greatest impact on the predictions, with older age driving the predictions towards non-survival and younger age driving the predictions towards survival. This is in keeping with the fact that age is the variable potentially yielding the second-most points in the SAPS III score. Most binary features predominantly influence mortality prediction when present in a unidirectional manner towards either survival or non-survival (eg, admission type of scheduled surgery pulls the prediction towards survival). For non-binary features in general, low values will drive mortality prediction towards either survival or non-survival and high values will drive the prediction in the opposite direction, although exceptions exist. An example is low median heart frequency (second top feature; figure 4), which generally drives mortality prediction towards survival but can be seen to drive predictions towards non-survival for some patients. For comparison, figure 4B illustrates the contributions for the features in the original SAPS III score. There are two marked differences: for all but one feature, the effect on mortality is unidirectional in the SAPS III score, whereas in our model, features can drive the prediction in either direction; and in the SAPS III score, individual features generally can have a greater effect on the prediction than in our model.
The dynamic risk prediction can also be explained at any given time for a particular patient; we illustrate a representative case from the holdout test cohort (figure 5). The patient was an 83-year-old female with a history of hypertension and paroxysmal atrial fibrillation. She was transferred from the medical ward to the ICU with hypoxic respiratory failure due to a community-acquired pneumonia, with a SAPS III score of 75 points at admission. She was initially treated with intermittent non-invasive ventilation but intubated 26 h after admission due to insufficient treatment response. Due to sedation and atrial fibrillation with a rapid ventricular response, the patient developed hypotension and vasopressor treatment was initiated 36 h after admission. Her condition gradually deteriorated from 40 h onwards, with an increasing oxygen demand and development of delirium. The patient died in the ICU 98 h after admission. In this case, age at admission drives mortality prediction towards non-survival throughout (dark orange ribbon), whereas median heart frequency is the most important feature pulling the prediction down towards survival for the bulk of the stay, along with median SBP and leucocytes. Some features can drive the prediction towards survival at one timepoint and towards non-survival at other timepoints (eg, minimum Glasgow Coma Scale [GCS]) or vice versa (eg, leucocytes), and others can oscillate between the two (eg, minimum SBP; figure 5A). In the same patient, the three most important contributions to the SAPS III model prediction at each hour all drive towards non-survival, with age at admission the most influential, followed by intra-hospital location before ICU and oxygenation, which are occasionally overtaken in importance by maximum heart frequency and minimum SBP (figure 5B). When considering the relative importance of all included features on the predictions for the full holdout test dataset over time, as illustrated by the mean rank, we note that mechanical ventilation gains importance (relative to the other features) over time, whereas creatinine, leucocytes, and platelets seem to lose importance, with their respective ranks diminishing (figure 6).
We further detail the importance of selected features and provide a visual example of how they interact (appendix p 9). When considering the GCS, it is clear that lower GCS is associated with a higher relative risk of non-survival (appendix p 9). Yet, due to feature interactions, the range of relative risks within each GCS level is quite wide; the imputed GCS values (using the population mean, which lay between 12 and 13) have essentially no impact on the prediction. The same pattern is observed for minimum and maximum hourly SBP, with low SBP associated with increased relative risk of non-survival and vice versa (appendix p 9). The final graph shows how the contributions from minimum and maximum SBP interact: the negative effect of having low minimum SBP can be countered by high maximum SBP within the same hour.
We additionally made a decision curve analysis to quantify the potential benefit of guiding treatment based on predictions from our model (appendix p 5).
Decision curve analysis: a novel method for evaluating prediction models.
In this study, we developed a risk prediction model providing dynamic, individual predictions of 90-day mortality of ICU patients. The model was trained on 44 binary and continuous SAPS III variables for more than 9000 patients hospitalised in four ICUs in the Capital Region of Denmark between 2011 and 2016. The model was updated at 1-h intervals and calibrated for more reliable predictions. Model performance increased over time and achieved a performance of AUROC of 0·88 (95% CI 0·87–0·89) and MCC of 0·57 (0·54–0·60) at time of discharge. The model was made explainable and the top features driving mortality prediction were identified both for an individual and the full holdout test dataset. Importantly, in the analysis of individual mortality predictions over time, we found that one feature can drive the prediction towards survival at one timepoint and towards non-survival at another. The predicted outcome varies: at time of admission, it encompasses in-ICU, in-hospital, and post-discharge mortality. Patients who die while at the ICU are probably quite different from patients who are discharged and die at home before the 90-day mark. Thus, the model adapts to account for the changing nature of the predicted outcome, making it more useful than one-off scores such as SAPS that are computed only once with data obtained during the first day of ICU admission. This finding underpins the importance of continuously updating decision support tools, which adapt to the evolving clinical picture and provide real-time guidance to clinicians. Such dynamic tools are likely to be more useful than the static scores that are currently implemented. Clinical decision making in the early stages of admission—eg, whether to commence treatment and how aggressively to treat a patient—is very different from decisions made later on, such as a decision whether to withdraw life-sustaining treatment.
We also found that certain features could compensate for one another. An example was the negative effect of having a low minimum SBP could be countered by a high maximum SBP within the same hour. Thus, the occurrence of low SBP values has less impact if the condition is correctable, which makes sense from a clinical point of view. Overall, we see that features have complex interactions over time and the ambiguity of features further emphasises the need for real-time machine learning-based decision support. We note that the features interact in a complex non-linear manner, unlike two-way or three-way interactions often used in generalised linear models. The LSTM architecture allows us to model complex, multidimensional interactions but this also makes clinical interpretation of the results more difficult and thus requires caution.
Some input features did not have the anticipated impact on the predictions. For instance, the comorbidities of metastatic cancer and AIDS did not alter predictions much; however, when they did, they often pulled the prediction towards survival. The reason for this counterintuitive association might be that the model was unable to learn the true importance of the conditions due to the small prevalence in the training dataset. Another possible explanation is that patients with these conditions die early in their ICU stay due to physiological derangements that cannot be mitigated and thus the comorbidities become less discriminating. Furthermore, patients with metastatic cancer or AIDS are, to some extent, selectively triaged to the ICU—ie, only the younger and non-terminally ill patients will be admitted. Another example of features not having the anticipated impact is seen in figure 6. An almost undetectable SBP (near 0 mm Hg) is only modestly associated with increased mortality and an SBP as high as 300 mm Hg appears to have a beneficial effect. Again, a plausible explanation might be that the model is unable to learn the importance of the extreme values due to a low prevalence in the training dataset. Additionally, the extreme SBP measurements are to some extent artifactual in the clinical setting and likely to have occurred due to flushing or occlusion of the arterial line. The model thus learns to moderate the impact of the extreme values.
The choice of method reflects the nature of a dynamic patient-level prediction problem from the perspective of the clinician: a patient’s mortality risk is constantly evolving and depends on the past as well as the current condition. An LSTM network is a special kind of recurrent neural network composed of LSTM units capturing long-range dependencies from sequential data via gated cells, determining whether or not to maintain information based on the importance it assigns to the information. This way, an LSTM-based machine learning prediction model—unlike, for example, logistic regression models—both learns from information about the temporal development and the interaction between the features.
Because we are missing three of the SAPS III variables, and because we chose 90-day overall mortality as the outcome measure, we are not able to do a direct comparison with SAPS III. However, in the original SAPS III study, Moreno and colleagues found that in a cohort from northern Europe, the SAPS III model had an AUROC of 0·814.
SAPS 3—from evaluation of the patient to evaluation of the intensive care unit. Part 2: development of a prognostic model for hospital mortality at ICU admission.
External validation studies have revealed AUROCs of 0·69 (95% CI 0·63–0·75) and 0·81 (0·79–0·93) in Denmark and Norway, respectively.
Comparison of Charlson comorbidity index with SAPS and APACHE scores for prediction of mortality following intensive care.
A comparison of SAPS II and SAPS 3 in a Norwegian intensive care unit population.
Our model had a higher predictive performance compared with these results, which we confirmed using an external dataset obtained after the study was completed.
Previous studies have applied similar methods to accomplish real-time predictions in an ICU setting. In a recent study, Meyer and colleagues described a recurrent neural network-based model for real-time prediction of bleeding, renal failure, and mortality in a cohort of cardiac surgery patients.
Machine learning for real-time prediction of complications in critical care: a retrospective study.
As in our study, they base their model on routinely collected data. However, they only report the discriminative performance of the model, not the calibration. This is an important issue if the model is intended for making predictions for single patients. Furthermore, the matter of model explainability is not addressed.
Meiring and colleagues showed that ICU prognostication can be improved by applying a dynamic approach accounting for changes in physiological parameters over the course of several days.
Optimal intensive care outcome prediction over time using machine learning.
In contrast to our study, they report that the best performance is accomplished around 2 days into the ICU admission. The reason for this discrepancy might be that they only use daily measurements for each feature and their neural network architecture is not well suited for dealing with time-series data. Hence, the full information hidden in temporal trends in the data was not exploited.
The European General Data Protection Regulation of 2018 communicates concerns with black-box predictions. It states that individuals have the right to “meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing” when automated decision making is used.
EU Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).
The legal and ethical concerns that arise from using complex predictive analytics in health care.
There are several examples of dubious conclusions drawn from automated decision models based on machine learning. An advantage with our model is that SHAP made our model explainable both in terms of the importance of individual features for ICU patient survival in general and those at patient level at any given timepoint. Thus, using such a model for decision support gives the clinician real-time information of the patient’s risk of dying and the specific features currently pulling towards non-survival. The importance of a continuously updated mortality prediction is shown by our finding that one feature can drive predictions towards either survival or non-survival depending on the timepoint of prediction during the ICU stay.
As in all secondary uses of health-care data, we had some missing data. We used LOCF to impute missing data, knowing that this approach is usually not a reputable imputation method.
Fallacies of last observation carried forward analyses.
LOCF can be problematic in at least two ways: it can distort temporal covariate tendencies causing misclassification of exposures and bias in unpredictable ways, and it can introduce statistically dependent replicates. We would argue that it makes sense in our case, because the absence of a datapoint is not necessarily void of information. Indeed, much can be inferred when it comes to measurements, especially in a setting as controlled as that of the ICU: the very absence of, for example, a pH value might simply mean that the physician actively chose not to run the analysis again because there was no need. This is arguably often the reality, so carrying the most recent values forward as proxies for missing values might be clinically meaningful. Along this line, artificial missingness was introduced because of the up-sampling of variables not measured every hour. In this case, the use of LOCF is just an imitation of the reasoning of a medical professional, who would derive a clinical assessment on the basis of the available knowledge. The use of LOCF yields replicates that are not statistically independent and could be considered a form of pseudo-replication. This could affect model performance and generalisability, but through cross-validation and regularisation during training, we expect this to have little real effect.
Our study is retrospective and based on ICU data from a densely populated, but rather small geographical area in Denmark. Hence, the model might be biased and reflect the clinical guidelines and treatment decisions made in this area. However, using data from fairly homogeneous settings can render the model useful for exactly the kinds of patients that physicians encounter in their daily work. Besides, the physicians who recorded the data, ordered the tests, and intervened when they saw fit did this with the aim of providing the best possible patient care and did not consider that the data might subsequently be used for prediction purposes. In this way, although retrospective, the data are unlikely to reflect information bias that is otherwise known to haunt prospective studies.
During training, the model was optimised to predict the chance of survival 90 days from ICU admission given data series with a fixed length. The trained model is able to make predictions on new data of varying lengths between admission and discharge, but it is not very accurate at the point of ICU admission. At this point, there are few datapoints available for model training and LSTM models are explicitly optimised for time-series prediction. We acknowledge that an ideal model should have a better performance at early stages as well, but it would not be possible to extract information about the temporal trends, and the model would then serve another purpose than that of this model and would require a different design or a combination of methods.
The present model is based on relatively few variables taken from SAPS III and, as such, the study is intended as a proof of concept. However, replication in a large validation dataset obtained from another geographical region could verify its robustness and confirms that it is ready to be turned into a clinical decision support tool to be tested in a randomised controlled trial. We intend to do future studies using this model, and are working on implementing the model into our new electronic medical record system (Epic; Verona, WI, USA).
A recent study successfully combined 10 years of disease history before ICU admission with measures from the first 24 h of ICU stay to predict mortality.
Survival prediction in intensive-care units based on aggregation of long-term disease history and acute physiology: a retrospective study of the Danish National Patient Registry and electronic patient records.
Our model provides a more accurate prediction of mortality, probably due to the high granularity of the ICU data included in our LSTM model. Thus, adding more detailed disease history to the present model might increase the performance even further and increase the performance upon ICU admission. Additionally, much information is hidden in the clinical notes in which physicians and other health-care professionals collect detailed phenotypic data. Adding such data might also improve the predictive ability.
Many machine learning methods are still opaque, and we have made progress using SHAP to open up and gauge what drives predictions. SHAP values, however, cannot resolve algorithmic bias should such prevail. Algorithmic bias is a genuine concern in the context of machine learning prediction models and comes about because these models have no underlying causal structure: they make predictions entirely on the basis of what humans have done before. This lack of a causal structure also means that prediction models can perform suboptimally when applied to minority populations because the algorithm has only seen few such patients. Thus, albeit explainable, our model is not necessarily fully actionable: age, for example, was the most important feature, but cannot be manipulated by the clinician. Furthermore, we cannot know if clinicians will act and if this action—eg, further correction of low blood pressure—will change the outcome even though low blood pressure strongly influences predictions. During training, the model learned a lot about correlations but nothing about causality. However, these new insights into complex feature interactions might guide our search for causal relations. To achieve actionable models, we would need to build the statistical model on a causal model of how physiological factors interact and react to interventions. Causal models reflect our best guess for the data-generating process, and allow for counterfactual reasoning; this notion is not new but is yet to converge with powerful machine learning methods. Because the ICU is a fairly controlled environment with many objective measurements available, it could be an interesting setting for combining these two disciplines. We gauge model performance by several different measures—eg, AUROC and MCC—but none of these measures encapsulate if a model prediction will result in a favourable change in patient care and outcome.
Making machine learning models clinically useful.
The next step in the process of establishing clinically applicable machine learning models is randomised clinical trials.
In conclusion, we developed an explainable LSTM model for ICU 90-day mortality prediction from a total dataset of more than 14 000 admissions of 11 000 patients from four mixed ICUs in Copenhagen, Denmark, with external validation. The predictive performance improved over the timecourse of an ICU stay. Model interpretation showed that input features can interact and compensate for one another and can pull towards survival at one timepoint and towards non-survival at another. None of these observations can be obtained from current static prognostic scores. Yet, before this kind of model can be used as a bedside tool, the results need to be confirmed in a randomised clinical trial.
SB and AP conceived the study, which was designed in detail by H-CT-M. H-CT-M did the data analysis, which was interpreted by H-CT-M, ABN, APN, BSK-H, PT, JS, TS, KB, SB, and AP. PJC, MH, LD, LS, TS, and PH extracted and handled the data. All authors contributed to the preparation of the Article and approved the final version.
LD, LS, and PH are employed by Daintel (as of Jan 1, 2020, Cambio has purchased Daintel), which is taking part in the BigTempHealth project funded by the Danish Innovation Fund , grant 5184–00102B . AP reports grants from Ferring and the Novo Nordisk Foundation. SB reports personal fees from Intomics and Proscion, as well as grants from the Novo Nordisk Foundation (grants NNF17OC0027594 and NNF14CC0001 ). All other authors declare no competing interests.