31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Promises of Big Data and Artificial Intelligence in Nephrology and Transplantation

      editorial

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Kidney diseases form part of the major health burdens experienced all over the world. Kidney diseases are linked to high economic burden, deaths, and morbidity rates. The great importance of collecting a large quantity of health-related data among human cohorts, what scholars refer to as “big data”, has increasingly been identified, with the establishment of a large group of cohorts and the usage of electronic health records (EHRs) in nephrology and transplantation. These data are valuable, and can potentially be utilized by researchers to advance knowledge in the field. Furthermore, progress in big data is stimulating the flourishing of artificial intelligence (AI), which is an excellent tool for handling, and subsequently processing, a great amount of data and may be applied to highlight more information on the effectiveness of medicine in kidney-related complications for the purpose of more precise phenotype and outcome prediction. In this article, we discuss the advances and challenges in big data, the use of EHRs and AI, with great emphasis on the usage of nephrology and transplantation.

          Related collections

          Most cited references110

          • Record: found
          • Abstract: found
          • Article: found

          A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury

          The early prediction of deterioration could have an important role in supporting healthcare professionals, as an estimated 11% of deaths in hospital follow a failure to promptly recognize and treat deteriorating patients (1). To achieve this goal requires predictions of patient risk that are continuously updated and accurate, and delivered at an individual level with sufficient context and enough time to act. Here we develop a deep learning approach for the continuous risk prediction of future deterioration in patients, building upon recent work that models adverse events from electronic health records (2–17) and using acute kidney injury—a common and potentially life-threatening condition (18)—as an exemplar. Our model was developed on a large, longitudinal dataset of electronic health records that cover diverse clinical environments, comprising 703,782 adult patients across 172 inpatient and 1,062 outpatient sites. Our model predicts 55.8% of all inpatient episodes of acute kidney injury, and 90.2% of all acute kidney injury that requires subsequent administration of dialysis, with a lead time of up to 48 h and a ratio of 2 false alerts for every true alert. In addition to predicting future acute kidney injury, our model provides confidence assessments and a list of the clinical features that are most salient to each prediction, alongside predicted future trajectories for clinically relevant blood tests (9). Although the recognition and prompt treatment of acute kidney injury is known to be challenging, our approach may offer opportunities for identifying patients at risk within a time window that enables early treatment. Adverse events and clinical complications are a major cause of mortality and poor patient outcomes, and substantial effort has been made to improve their recognition 18,19 . Few predictors have found their way into routine clinical practice, either because they lack effective sensitivity and specificity, or because they report already existing damage 20 . One example relates to AKI, a potentially life threatening condition affecting approximately 1 in 5 US inpatient admissions 21 . Although a substantial proportion of cases are thought to be preventable with early treatment 22 , current AKI detection algorithms depend on changes in serum creatinine as a marker of acute decline in renal function. Elevation of serum creatinine lags behind renal injury, resulting in delayed access to treatment. This supports a case for preventative ‘screening’ type alerts, but there is no evidence that current rule based alerts improve outcomes 23 . For predictive alerts to be effective they must empower clinicians to act before major clinical decline has occurred by: (i) delivering actionable insights on preventable conditions; (ii) being personalised for specific patients; (iii) offering sufficient contextual information to inform clinical decision-making; and (iv) being generally applicable across patient populations 24 . Promising recent work on modelling adverse events from EHR 2–17 suggests that the incorporation of machine learning may enable early prediction of AKI. Existing examples of sequential AKI risk models have either not demonstrated a clinically-applicable level of predictive performance 25 or have focused on predictions across a short time horizon, leaving little time for clinical assessment and intervention 26 . Our proposed system is a recurrent neural network that operates sequentially over individual electronic health records, processing the data one step at a time and building an internal memory that keeps track of relevant information seen up to that point. At each time point the model outputs a probability of AKI occurring at any stage of severity within the next 48 hours, although our approach can be extended to other time windows or AKI severities (see Extended Data Table 1). When the predicted probability exceeds a specified operating point threshold, the prediction is considered positive. This model was trained using data curated from a multisite retrospective dataset of 703,782 adult patients from all available sites at the US Department of Veterans Affairs (VA) - the largest integrated health care system in the United States. The dataset consisted of information available from the hospital EHR in digital format. The total number of independent entries in the dataset was approximately 6 billion, including 620,000 features. Patients were randomised across training (80%), validation (5%), calibration (5%) or test (10%) sets. A ground truth label for the presence of AKI at any given point in time was added using the internationally accepted “Kidney Disease: Improving Global Outcomes (KDIGO)” criteria 18 ; the incidence of KDIGO AKI was 13.4% of admissions. (Detailed descriptions of the model and dataset are provided in the Methods, and Extended Data Figures 1, 2 & 3.) Figure 1 shows the use of our model. At every point throughout an admission the model provides updated estimates of future AKI risk, along with an associated degree of uncertainty. Providing the uncertainty associated with a prediction may help clinicians distinguish ambiguous cases from predictions fully supported by the available data. Identifying an increased risk of future AKI sufficiently in advance is critical, as longer lead times may allow preventative action to be taken. This is possible even when clinicians may not be actively intervening with, or monitoring a patient (Supplementary Information section A for examples) With our approach, 55.8% of inpatient AKI events of any severity were predicted early within a window of up to 48 hours in advance, with a ratio of two false predictions for every true positive. This corresponds to an area under the receiver operating characteristic curve (ROC AUC) of 92.1% and an area under the precision-recall curve (PR AUC) of 29.7%. Set at this threshold our predictive model would, if operationalised, trigger a daily clinical assessment in 2.7% of hospitalised patients in this cohort (Extended Data Table 2). Sensitivity was particularly high in patients who went on to develop lasting complications as a result of AKI. The model provided early predictions correctly in 84.3% of episodes where administration of in-hospital or outpatient dialysis was required within 30 days of the onset of AKI of any stage, and 90.2% of cases where regular outpatient administration of dialysis was scheduled within 90 days of the onset of AKI (Extended Data Table 3). Figure 2 shows the corresponding ROC and PR curves, as well as a spectrum of different operating points of the model. An operating point can be chosen to either further increase the proportion of AKI predicted early, or reduce the percentage of false predictions at each step, according to clinical priority (Figure 3). Applied to stage 3 AKI, 84.1% of inpatient events were predicted up to 48 hours in advance, with a ratio of two false predictions for every true positive (Extended Data Table 4). To respond to these alerts on a daily basis, clinicians would need to attend to approximately 0.8% of in-hospital patients (Extended Data Table 2). The model correctly identifies substantial future increases in seven auxiliary biochemical tests in 88.5% of cases (Supplement B), and provides information about the factors that are most salient to the computation of each risk prediction. The greatest saliency was identified for laboratory tests known to be relevant to renal function (see Supplement C) The predictive performance of our model was maintained across time and hospital sites, demonstrated by additional experiments that show generalisability to data acquired at time points after the model was trained (Extended Data Table 5). Our approach significantly outperformed (p < 0.001) established state-of-the-art baseline models (Supplement D). For example, we implemented a baseline model with gradient-boosted trees using manually curated features that are known to be relevant for modelling kidney function and in the delivery of routine care (Supplementary Information, sections E and F), combined with aggregate statistical information on trends observed in the recent history of the patient. This yielded 3599 clinically relevant features provided to the baselines at each step (see Methods). For the same level of precision, this baseline model was able to detect 36.0% of all inpatient AKI episodes up to 48 hours ahead of time, compared to 55.8% for our model. Of the false positive alerts made by our model, 24.9% were positive predictions made even earlier than the 48 hour window in patients who subsequently developed AKI (Extended Data Figure 4). 57.1% of these occurred in patients with pre-existing chronic kidney disease (CKD), who are at a higher risk of developing AKI. Of the remaining false positive alerts, 24.1% were trailing predictions that occurred after an AKI episode had already begun; such alerts can be filtered out in clinical practice. For positive risk predictions where no AKI was subsequently observed in this retrospective dataset, it is probable that many occurred in patients at risk of AKI where appropriate preventative treatment was administered which averted subsequent AKI. In addition to these early and trailing predictions, 88% of the remaining false positive alerts occurred in patients with severe renal impairment, known renal pathology, or evidence in the EHR that the patient required clinical review (Extended Data Figure 4). Our aim is to provide risk predictions that enable personalized preventative action to be delivered at a large scale. The way these predictions are used may vary by clinical setting: a trainee doctor could be alerted in real time to each patient under their care, while a specialist nephrologist or rapid response teams 27 can identify high risk patients to prioritise their response. This is possible because performance was consistent across multiple clinically important groups, notably those at an elevated risk of AKI (Supplement G). Our model is designed to complement existing routine care, as it is trained specifically to predict AKI that happened in this retrospective dataset despite existing best practices. Although we demonstrate a model trained and evaluated on a clinically representative set of patients from the entire VA health care system, the demographic is not representative of the global population. Female patients comprised 6.38% of patients in the dataset, and model performance was lower for this demographic (Extended Data Table 6). Validating the predictive performance of the proposed system on a general population would require training and evaluating the model on additional representative datasets. Future work will need to address the under-representation of sub-populations in the training data 28 and overcome the impact of potential confounding factors related to hospital processes 29 . KDIGO is an indicator of AKI that lags long after the initial renal impairment, and model performance could be enhanced by improvements in the ground-truth definition of AKI and data quality 30 . Despite the state-of-the-art retrospective performance of our model compared to existing literature, future work should now prospectively evaluate and independently validate the proposed model to establish its clinical utility and effect on patient outcomes, as well as explore the role of the model in researching strategies for delivering preventative care for AKI. In summary, we demonstrate a deep learning approach for the continuous prediction of AKI within a clinically-actionable window of up to 48 hours in advance. We report performance on a clinically diverse population and across a large number of sites to show that our approach may allow for the delivery of potentially preventative treatment, prior to the physiological insult itself in a large number of the cases. Our results open up the possibility for deep learning to guide the prevention of clinically important adverse events. With the possibility of risk predictions delivered in clinically-actionable windows alongside the increasing size and scope of EHR datasets, we now shift to a regime where the role for machine learning in clinical care can grow rapidly, supplying new tools to enhance the patient and clinician experience, and potentially becoming a ubiquitous and integral part of routine clinical pathways. Methods Data Description The clinical data used in this study was collected by the US Department of Veterans Affairs and transferred to DeepMind in de-identified format. No personal information was included in the dataset, which met HIPAA “Safe Harbor” criteria for de-identification. The US Department of Veterans Affairs (VA) serves a population of over nine million veterans and their families across the entire United States of America. The VA is composed of 1,243 health care facilities (sites), including 172 VA Medical Centers and 1,062 outpatient facilities 31 . Data from these sites is aggregated into 130 data centres, of which 114 had data of inpatient admissions that we used in this study. Four sites were excluded since they had fewer than 250 admissions during the five year time period. No other patients were excluded based on location. The data comprised all patients aged between 18 and 90 admitted for secondary care to medical or surgical services from the beginning of October 2011 to the end of September 2015, including laboratory data, and where there was at least one year of EHR data prior to admission. The data included medical records with entries up to 10 years prior to each admission date and up to two years afterwards, where available. Where available in the VA database, data included outpatient visits, admissions, diagnoses as International Statistical Classification of Diseases and Related Health Problems (ICD9) codes, procedures as Current Procedural Terminology (CPT) codes, laboratory results (including but not limited to biochemistry, haematology, cytology, toxicology, microbiology and histopathology), medications and prescriptions, orders, vital signs, health factors and note titles. Free text, and diagnoses that were rare (fewer than 12 distinct patients with at least one occurrence in the VA database), were excluded to ensure all potential privacy concerns were addressed. In addition, conditions that were considered sensitive were excluded prior to transfer, such as patients with HIV/AIDS, sexually transmitted diseases, substance abuse, and those admitted to mental health services. Following this set of inclusion criteria, the final dataset comprised 703,782 patients, providing 6,352,945,637 clinical event entries. Each clinical entry denoted a single procedure, laboratory test result, prescription, diagnosis etc, with 3,958,637,494 coming from outpatient events and the remaining 2,394,308,143 events from admissions. Extended Data Table 6 contains an overview of patient demographics in the data as well as prevalence of conditions associated with AKI across the data splits. The final dataset was randomly divided into training (80% of observations), validation (5%), calibration (5%) and testing (10%) sets. All data for a single patient was assigned to exactly one of these splits. Data Preprocessing Feature Representation Every patient in the dataset was represented by a sequence of events, with each event providing the patient information that was recorded within a 6 hour period, i.e. each day was broken into four 6 hour periods and all records occurring within the same 6 hour period were grouped together. The available data within these six-hour windows, along with additional summary statistics and augmentations, formed a feature set that was used as input to our predictive models. Extended Data Figure 1 provides a diagrammatic view of a patient sequence and its temporal structure. We did not perform any imputation of missing numerical values, because explicit imputation of missing values does not always provide consistent improvements to predictive models based on electronic health records 32 . Instead, we associated each numerical feature with one or more discrete presence features to enable our models to distinguish between the absence of a numerical value and an actual value of zero. Additionally, these presence features encoded whether a particular numerical value is considered to be normal, low, high, very low or very high. For some data points, the explicit numerical values were not recorded (usually when the values were considered normal), and the provision of this encoding of the numerical data allowed our models to process these measurements even in their absence. Discrete features like diagnostics or procedural codes were also encoded as binary presence features. All numerical features were normalised to the [0, 1] range after capping the extreme values at the 1st and 99th percentile. This prevents the normalisation from being dominated by potentially large data entry errors while preserving most of the signal. Each clinical feature was mapped onto a corresponding high-level concept, such as procedure, diagnosis, prescription, lab test, vital sign, admission, transfer etc. A total of 29 such high-level concepts were present in the data. At each step, a histogram of frequencies of these concepts among the clinical entries that take place at that step was provided to the models along with the numerical and binary presence features. The approximate age of each patient in days, as well as which 6 hour period in the day the data is associated with, were provided as explicit features to the models. In addition, we provided some simple features that make it easier for the models to predict the risk of developing AKI. In particular, we provided the median yearly creatinine baseline and the minimum 48 hours creatinine baseline as additional numerical features. These are the baseline values that are used in the KDIGO criteria and help give important context to the models on how to interpret new serum creatinine measurements as they become available. We additionally computed three historical aggregate feature representations at each step: one for the past 48 hours, one for the past 6 months, and one for the past 5 years. All histories were optionally provided to the models and the decision on which combination of historical data to include was based on the model performance on the validation set. We did this historical aggregation for discrete features by including whether they were observed in the historical interval or not. For numerical features we included the count, mean, median, standard deviation, minimum and maximum value observed in the interval, as well as simple trend features like the difference between the last observed value and the minimum or maximum and the average difference between subsequent steps (which measures the temporal short-term variability of the measurement). Supplementary Information section H provides the effect of volume and recency of available data on model performance. Because patient measurements are made irregularly, not all 6-hour time periods in a day will have new data associated with them. Our models operate at regular time intervals regardless, and all time periods without new measurements include only the available metadata, and optionally the historical aggregate features. This approach makes continuous risk predictions possible, and allows our models to utilise the patterns of missingness in the data during the training process. For about 35% of all entries, the day on which they occurred was known, but not the specific time during the day. For each day in the sequence of events, we aggregated these unknown-time entries into a specific bucket that was appended to the end of the day. This ensured that our models could iterate over this information without potentially leaking information from the future. Our models were not allowed to make predictions from these surrogate points and they were not factored into the evaluation. The models can utilise the information contained within the surrogate points on the next time step, corresponding to the first interval of the following day. Diagnoses in the data are sometimes known to be recorded in the EHR prior to the time when an actual diagnosis was made clinically. To avoid leaking future information to the models, we shifted all of the diagnoses within each admission to the very end of that admission and only provided them to the models at that point, where they can be factored in for future admissions. This discards potentially useful information, so the performance obtained in this way is conservative by design and it is possible that in reality the models would be able to perform better with this information provided in a consistent way. Ground Truth Labels using KDIGO The patient AKI states were computed at each time step based on the KDIGO 18 criteria, the recommendations of which are based on systematic reviews of relevant trials. KDIGO accepts three definitions of AKI: an increase in serum creatinine of 0.3mg/dl (26.5 μ mol/l) within 48 hours; an increase in serum creatinine of 1.5 times a patient’s baseline creatinine level, known or presumed to have occurred within the prior 7 days; or a urine output of <0.5 ml/kg/h over 6 hours 18 . The first two definitions were used to provide ground truth labels for the onset of an AKI; the third definition could not be used as urine output was not recorded digitally in the majority of sites that formed part of this work. A baseline of median annualised creatinine was used where previous measurements where available; where these were not present the Modification of Diet in Renal Disease (MDRD) formula was applied to estimate a baseline creatinine. Using the KDIGO criteria based on serum creatinine and its corresponding definitions for AKI severity, three AKI categories were obtained: ‘all AKI’ (KDIGO stages 1, 2 & 3), ‘moderate and severe AKI’ (KDIGO stages 2 & 3), and ‘severe AKI’ (KDIGO stage 3). The AKI stages were computed at times when there was a serum creatinine measurement present in the sequence and then copied forward in time until the next creatinine measurement, at which time the ground truth AKI state was updated accordingly. To avoid basing the current estimate of the KDIGO AKI stage on a previous measurement that may no longer be reliable, the AKI states were propagated for at most 4 days forward in case no new creatinine measurements were observed. From that point onwards, AKI states were marked as unknown. Patients experiencing acute kidney injury tend to be closely monitored and their levels of serum creatinine are measured regularly, so an absence of a measurement for multiple days in such cases is uncommon. A gap of 4 days between subsequent creatinine measurements represents the 95th percentile in the distribution of time between two consecutive creatinine measurements. The prediction target at each point in time is a binary variable that is positive if the AKI category of interest (e.g., all AKI) occurs within a chosen future time horizon. If no AKI state was recorded within the chosen horizon, this was interpreted as a negative. We use eight future time horizons, 6h,12h, 18h, 24h, 36h, 48h, 60h, and 72h ahead, which are all available at each time point. Event sequences of patients undergoing renal replacement therapy (RRT) were excluded from the target labels heuristically based on the data entries of RRT procedures being performed in the EHR, for the duration of dialysis administration. We have excluded entire subsequences of events between RRT procedures that occur within a week of each other. The edges of the subsequence were also appropriately excluded from label computations. Models for predicting AKI Our predictive system operates sequentially over the electronic health record. At each time point, input features, which we described above, were provided to a statistical model whose output is a probability of any-severity stage of AKI occurring in the next 48 hours. If this probability exceeds a chosen operating threshold, we make a positive prediction that can then trigger an alert. This is a general framework within which existing approaches also fit, and we describe the baseline methods in the next section. The novelty of this work is in the design of the particular model that is used and its training procedure, and the demonstration of its effectiveness - on a large-scale EHR dataset and across many different regimes - in making useful predictions of future AKI. Extended Data Figure 2 gives a schematic view of our model, which makes predictions by first transforming the input features using an embedding module. This embedding is fed into a multi-layer recurrent neural network, the output of which at every time point is fed into a prediction module that provides the probability of future AKI at the time horizon for which the model will be trained. The entire model can be trained end-to-end, i.e. the parameters can be learned jointly without pretraining any parts of the model. To provide useful predictions, we train an ensemble of predictors to estimate the model’s confidence, and the resulting ensemble predictions are then calibrated using isotonic regression to reflect the frequency of observed outcomes 33 . Embedding modules. The embedding layers transform the high-dimensional and sparse input features into a lower-dimensional continuous representation that makes subsequent prediction easier. We use a deep multilayer perceptron with residual connections and rectified-linear (ReLU) activations. We use L1 regularisation on the embedding parameters to prevent overfitting and to ensure that our model focuses on the most salient features. We compared simpler linear transformations, which did not perform as well as the multi-layer version we used. We also compared unsupervised approaches such as factor analysis, standard auto-encoders and variational auto-encoders, but did not find any significant advantages in using these methods. Recurrent neural network core. Recurrent neural networks (RNNs) run sequentially over the EHR entries and are able to implicitly model the historical context of a patient by modifying an internal representation (or state) through time. We use a stacked multiple-layer recurrent network with highway connections between each layer 34 , which at each time step takes the embedding vector as an input. We use the Simple Recurrent Unit (SRU) network as the RNN architecture, with tanh activations. We chose this from a broad range of alternative RNN architectures, specifically the long short-term memory (LSTM) 35 , update gate RNN (UGRNN) and Intersection RNN 36 , simple recurrent units (SRU) 37,38 , gated recurrent units (GRU) 39 , the Neural Turing Machine (NTM) 40 , memory-augmented neural network (MANN) 41 , the Differentiable Neural Computer (DNC) 42 , and the Relational Memory Core (RMC) 43 . These alternatives did not provide significant performance improvements over the SRU architecture (see Supplement D). Prediction targets and training objectives. The output of the RNN is fed to a final linear prediction layer that makes predictions over all 8 future prediction windows (6 hour windows from 6 hours ahead to 72 hours ahead). We use a cumulative distribution function layer (CDF) across different time windows to encourage monotonicity, since the presence of AKI within a shorter time window implies a presence of AKI within a longer time window. Each of the resulting eight outputs provides a binary prediction for AKI severity at a specific time window and is compared to the ground truth label using the cross-entropy loss function (Bernoulli log-likelihood). We also make a set of auxiliary numerical predictions, where at each step we also predict the maximum future observed value of a set of laboratory tests over the same set of time intervals as used to make the future AKI predictions. The laboratory tests predicted are ones known to be relevant to kidney function, specifically: creatinine, urea nitrogen, sodium, potassium, chloride, calcium and phosphate. This multitask approach results in better generalisation and more robust representations, especially under class imbalance 44–46 . The overall improvement we observed from including the auxiliary task was around 3% PR AUC in most cases (see Supplement A for more details). Our overall loss function is the weighted sum of the cross-entropy loss from the AKI-predictions and the squared loss for each of the seven laboratory test predictions. We investigated the use of oversampling and overweighting of the positive labels to account for class imbalance. For oversampling, each mini-batch contains a larger percentage of positive samples than average in the entire dataset. For overweighting, prediction for positive labels contributes proportionally more to the total loss. Training and hyperparameters. We selected our proposed model architecture among several alternatives based on the validation set performance (see Supplement D) and have subsequently performed an ablation analysis of the design choices (see Supplement I). All variables are initialised via normalised (Xavier) initialisation 47 and trained using the Adam optimisation scheme 48 . We employ exponential learning rate decay during training. The best validation results were achieved using an initial learning rate of 0.001 decayed every 12,000 training steps by a factor of 0.85, with a batch size of 128 and a backpropagation through time window of 128. The embedding layer is of size 400 for each of the numerical and presence input features (800 in total when concatenated) and uses 2 layers. The best performing RNN architecture used a cell size of 200 units per layer and 3 layers. A detailed overview of different hyperparameter combinations evaluated in the experiments is available in Supplement J. We conducted extensive hyperparameter explorations of dropout rates for different kinds of dropout to determine the best model regularisation. We have considered input dropout, output dropout, embedding dropout, cell state dropout and variational dropout. None of these had led to improvements, so dropout is not included in our model. Competitive Baseline Methods Established models for future AKI prediction make use of L 1 -regularised logistic regression or gradient boosted trees (GBTs), trained on a clinically relevant set of features known to be important either for routine clinical practice or the modelling of kidney function. A curated set of clinically-relevant features was chosen using existing AKI literature (see Supplement F) and the consensus opinion of six clinicians: three senior attending physicians with over twenty years expertise, one nephrologist and two intensive care specialists; and three clinical residents with expertise in nephrology, internal medicine and surgery. This set was further extended to include 36 of the most salient features discovered by our deep learning model that were not in the original list, to give further predictive signal to the baseline. The final curated dataset contained 315 base features of demographics, admission information, vital sign measurements, select laboratory tests and medications, and diagnoses of chronic conditions directly associated with an increased risk of AKI. The full feature set is listed in Supplement E We additionally computed a set of manually engineered features (yearly and 48-hourly baseline creatinine levels (consistent with KDIGO guidelines), the ratio of blood urea nitrogen to serum creatinine, grouped severely reduced glomerular filtration rate (corresponding to stages 3a to 5), and flagging diabetic patients by combining ICD9 codes and values of measured haemoglobin A1c) and a representation of the short-term and long-term history of a patient (see ‘Feature representation’). These features were provided explicitly, since the interaction terms and historical trends might not have been recovered by simpler models. This resulted in a total of 3599 possible features for the baseline model. We provide a table with a full set of baseline comparison in Supplement D. Evaluation The data was split into training, validation, calibration and test sets in such a way that information from a given patient is present only in one split. The training split was used to train the proposed models. The validation set was used to iteratively improve the models by selecting the best model architectures and hyperparameters. The models selected on the validation set were recalibrated on the calibration set in order to further improve the quality of the risk predictions. Deep learning models with softmax or sigmoid output trained with cross-entropy loss are prone to miscalibration, and recalibration ensures that consistent probabilistic interpretations of the model predictions can be made 49 . For calibration we considered Platt scaling 50 and Isotonic Regression 33 . To compare uncalibrated predictions to recalibrated ones we used the Brier score 51 and reliability plots 52 . The best models were finally evaluated on the independent test set that was held out during model development. The main metrics used in model selection and the final report are: the AKI episode sensitivity, the area under the precision-recall curve (PR AUC), the area under the receiver-operating curve (ROC AUC), and the per-step precision, per-step sensitivity and per-step specificity. The AKI episode sensitivity corresponds to the percentage of all AKI episodes that were correctly predicted ahead of time within the corresponding time windows of up to 48 hours. In contrast, the precision is computed per-step since the predictions are made at each step, to account for the rate of false alerts over time. Due to the sequential nature of making predictions, the total number of positive steps does not directly correspond to the total number of distinct AKI episodes. Multiple positive alerting opportunities may be associated with a single AKI episode and different AKI episodes may offer a different number of such early alerting steps depending on how late they occur within the admission. AKIs occurring later during in-hospital stay can be predicted earlier than those that occur immediately upon admission. To better assess the clinical applicability of the proposed model we explicitly compute the AKI episode sensitivity for different levels of step-wise precision. Given that the models were designed for continuous monitoring and risk prediction, they were evaluated at each 6-hour time step within all of the admissions for each patient except for the steps within AKI episodes which were ignored. The models were not evaluated on outpatient events. All steps where there was no record of AKI occurring in the relevant future time window were considered as negative examples. Approximately 2% of individual time steps presented to the models sequentially were associated with a positive AKI label, so the AKI prediction task is class-imbalanced. For per-step performance metrics, we report both the area under the receiver operating characteristic curve (ROC AUC) as well as the area under the precision-recall curve (PR AUC). PR AUC is known to be more informative for class-imbalanced predictive tasks 53 , as it is more sensitive to changes in the number of false positive predictions. To gauge uncertainty on a trained model’s performance we calculated 95% confidence intervals with the pivot bootstrap estimator 54 . This was done by sampling the entire validation and test dataset with replacement 200 times. Because bootstrapping assumes the resampling of independent events, we resample entire patients instead of resampling individual admissions or time steps. Where appropriate we also compute a Mann–Whitney U test (two-sided) 55 on the samples for the respective models. To quantify the uncertainty on model predictions (versus overall performance) we trained an ensemble of 100 models with a fixed set of hyperparameters but different initial seeds. This follows similar uncertainty approaches in supervised learning 56 and medical imaging predictions 57 . The prediction confidence was assessed by inspecting the variance over the 100 model predictions from the ensemble. This confidence reflected the accuracy of a prediction: the mean standard deviation of false positive predictions was higher than the mean standard deviation of true positive predictions and similarly for false negative versus true negative predictions (p-value < 0.01, see Supplement K). Reporting Summary Further information on experimental design is available in the Nature Research Reporting Summary linked to this article. Ethics and Information Governance This work, and the collection of data on implied consent, received Tennessee Valley Healthcare System Institutional Review Board (IRB) approval from the US Department of Veterans Affairs. De-identification was performed in line with the Health Insurance Portability and Accountability Act (HIPAA), and validated by the US Department of Veterans Affairs Central Database and Information Governance departments. Only de-identified retrospective data was used for research, without the active involvement of patients. Code Availability We make use of several open-source libraries to conduct our experiments, namely the machine learning framework TensorFlow (https://github.com/tensorflow/tensorflow) along with the TensorFlow library Sonnet (https://github.com/deepmind/sonnet) which provides implementations of individual model components 58 . Our experimental framework makes use of proprietary libraries and we are unable to publicly release this code. We detail the experiments and implementation details in the methods section and in the supplementary figures to allow for independent replication. Data Availability The clinical data used for the training, validation and test sets was collected at the US Department of Veterans Affairs and transferred to a secure data centre with strict access controls in de-identified format. Data was used with both local and national permissions. It is not publicly available and restrictions apply to its use. The de-identified dataset, or a test subset, may be available from the US Department of Veterans Affairs subject to local and national ethical approvals. Extended Data Extended Data Figure 1 | The sequential representation of EHR data. All EHR data available for each patient was structured into a sequential history for both inpatient and outpatient events in six hourly blocks, shown here as circles. In each 24 hour period events without a recorded time were included in a fifth block. Apart from the data present at the current time step, the models optionally receive an embedding of the previous 48 hours and the longer history of 6 months or 5 years. Extended Data Figure 2 | The proposed model architecture. The best performance was achieved by a multitask deep recurrent highway network architecture on top of an L1-regularised deep residual embedding component that learns the best data representation end-to-end without pre-training. Extended Data Figure 3 | Calibration. a, b, The predictions were recalibrated using isotonic regression before (a) and after (b) calibration. Model predictions were grouped into 20 buckets, with a mean model risk prediction plotted against the percentage of positive labels in that bucket. The diagonal line demonstrates the ideal calibration. Extended Data Figure 4 | Analysis of false positive predictions. a, For prediction of any AKI within 48 h at 33% precision, nearly half of all predictions are trailing, after the AKI has already occurred (orange bars) or early, more than 48 h prior (blue bars). The histogram shows the distribution of these trailing and early false positives for prediction. Incorrect predictions are mapped to their closest preceding or following episode of AKI (whichever is closer) if that episode occurs in an admission. For ±1 day, 15.2% of false positives correspond to observed AKI events within 1 day after the prediction (model reacted too early) and 2.9% correspond to observed AKI events within 1 day before the prediction (model reacted too late). b, Subgroup analysis for all false-positive alerts. In addition to the 49% of false-positive alerts that were made in admissions during which there was at least one episode of AKI, many of the remaining false-positive alerts were made in patients who had evidence of clinical risk factors present in their available electronic health record data. These risk factors are shown here for the proposed model that predicts any stage of AKI occurring within the next 48 h. Extended Data Table 1 | Model performance for predicting AKI within the full range of possible prediction windows from 6-72 hours. On shorter time windows, closer to the actual onset of AKI, the model achieves a higher ROC AUC (a), but lower PR AUC (b). This stems from different numbers of positive steps within windows of different length. These differences affect both the model precision and the false positive rate. When making predictions across shorter time windows there is more uncertainty in the exact time of the AKI onset due to minor physiological fluctuations and this results in a lower precision being needed in order to achieve high sensitivity. 95% bootstrap pivot confidence intervals are calculated using n=200 bootstrap samples. a ROC AUC [95% CI]  Time windows Any AKI AKI stages 2 and 3 AKI stage 3  24h 93.4% [93.3, 93.6] 97.1% [96.9, 97.3] 98.8% [98.7, 98.9]  48h 92.1% [91.9, 92.3] 95.7% [95.5, 96.0] 98.0% [97.8, 98.2]  72h 91.4% [91.1, 91.6] 94.7% [94.4, 95.0] 97.3% [97.2, 97.6] b PR AUC [95% CI]  Time windows Any AKI AKI stages 2 and 3 AKI stage 3  24h 25.9% [24.6, 27.0] 36.8% [35.1, 38.7] 47.6% [45.1,49.7]  48h 29.7% [28.5, 30.8] 37.8% [36.1, 39.6] 48.7% [46.4, 51.1]  72h 31.7% [30.6, 32.8] 37.4% [35.6, 39.1] 48.0% [46.1,49.9] Extended Data Table 2 | Daily frequency of true and false positive alerts when predicting different stages of AKI. The frequency of alerts and its standard deviation are shown for a time window of 48 hours an operating point corresponding to a 1:2 TP:FP ratio (N=5101 days). On an average day, clinicians would receive true positive alerts of AKI predicted to occur within a window of 48 hours ahead in 0.85% of all in-hospital patients, and a false positive prediction of a future AKI in 1.89% of patients, when predicting the future AKI of any severity. Assuming none of the false positives can be filtered out and immediately discarded, clinicians would need to attend to approximately 2.7% of all in-hospital patients. For the most severe stages of AKI, the model alerts on an average day in 0.8% of all patients. Of those, 0.27% are true positives and 0.56% are false positives. Note that there are multiple time steps at which the predictions are made within each day, so the TP:FP ratio of the daily alerts differs slightly from the step-wise ratio. (a) Daily frequency of true and false positive alerts when predicting any stage of AKI. (b) Daily frequency of true and false positive alerts when predicting KDIGO AKI stages two and above. (c) Daily frequency of true and false positive alerts when predicting the most severe stage of AKI - KDIGO AKI stage 3. a  Alert type Frequency predicting any stage of AKI  True positive alerts 0.85% ± 0.71  False positive alerts 1.89% ± 1.20  No alerts 97.26% ± 1.63 b  Alert type Frequency predicting KDIGO AKI stages 2 and above  True positive alerts 0.30% ± 0.35  False positive alerts 0.64% ± 0.55  No alerts 99.06% ± 0.75 c  Alert type Frequency predicting KDIGO AKI stage 3  True positive alerts 0.27% ± 0.33  False positive alerts 0.56% ± 0.85  No alerts 99.17% ± 0.96 Extended Data Table 3 | Model performance on patients requiring subsequent dialysis. Model performance only in AKI cases where either in-hospital or outpatient administration of dialysis is required within 30 days of the onset of AKI, or where regular outpatient administration of dialysis is scheduled within 90 days. The model successfully predicts a large proportion of these AKI cases early, 84.3% of AKI cases where there is any dialysis administration occurring within 30 days and 90.2% of cases where regular outpatient administration of dialysis occurs within 90 days. Subgroup name Sensitivity (AKI episode) PRAUC ROC AUC Sensitivity (step) Specificity (step) In-hospital/outpatient dialysis within 30 days 84.3% 70.5% 83.5% 67.7% 83.3% Outpatient dialysis within 90 days 90.2% 71.9% 83.8% 76.5% 76.3% Extended Data Table 4 | Operating points for predicting AKI up to 48 hours ahead of time. (a) For prediction of any AKI, the model correctly identifies 55.8% of all AKI episodes early if allowing for two false positives for every true positive, and 34.7% if allowing for one false positive for every true positive. For more severe AKI stages it is possible to achieve a higher sensitivity for any fixed level of precision. Performance increases for prediction of (b) AKI stages 2 & 3, and (c) AKI stage 3 alone. 95% bootstrap pivot confidence intervals are calculated using n=200 bootstrap samples for all tables. a Operating points for predicting any AKI up to 48 hours ahead of time  Precision True positive / False positive Sensitivity [95% Cl] (AKI episode) Sensitivity [95% Cl] (step) Specificity [95% Cl] (step)  20.0% 1:4 76.7% [75.6, 77.8] 58.3% [56.9, 59.8] 94.8% [94.6, 95.1]  25.0% 1:3 68.2% [66.9, 69.7] 47.7% [46.1,49.4] 96.8% [96.6, 97.0]  33.0% 1:2 55.8% [53.9, 57.7] 35.0% [33.3, 36.7] 98.4% [98.3, 98.5]  40.0% 2:3 46.6% [44.5, 49.0] 27.1% [25.2, 28.9] 99.1% [99.0, 99.2]  50.0% 1:1 34.7% [32.0, 37.2] 18.5% [16.7, 20.3] 99.6% [99.5, 99.6]  60.0% 3:2 24.7% [21.8, 27.3] 12.4% [10.5, 13.9] 99.8% [99.8, 99.8]  75.0% 3:1 12.0% [9.3, 14.6] 5.5% [3.9, 7.0] 100.0% [99.9, 100.0] b Operating points for predicting AKI stages 2 and 3 up to 48 hours ahead of time  Precision True positive / False positive Sensitivity [95% Cl] (AKI episode) Sensitivity [95% Cl] (step) Specificity [95% Cl] (step)  20.0% 1:4 82.0% [80.6, 83.5] 65.8% [64.0, 67.9] 98.5% [98.4, 98.6]  25.0% 1:3 77.8% [76.3, 79.7] 60.4% [58.3, 62.8] 99.0% [98.9, 99.1]  33.0% 1:2 71.4% [69.6, 73.7] 51.8% [49.6, 54.8] 99.4% [99.4, 99.5]  40.0% 2:3 65.2% [63.0, 67.7] 44.6% [42.1,47.3] 99.6% [99.6, 99.7]  50.0% 1:1 56.2% [54.0, 59.2] 35.8% [33.5, 38.9] 99.8% [99.8, 99.8]  60.0% 3:2 45.1% [42.2, 48.6] 26.3% [23.8, 29.4] 99.9% [99.9, 99.9]  75.0% 3:1 27.5% [24.2, 31.5] 13.8% [11.7, 16.3] 100.0% [100.0, 100.0] c Operating points for predicting AKI stage 3 up to 48 hours ahead of time  Precision True positive / False positive Sensitivity [95% Cl] (AKI episode) Sensitivity [95% Cl] (step) Specificity [95% Cl] (step)  20.0% 1:4 91.2% [90.4, 92.3] 80.3% [78.4, 82.4] 98.8% [98.7, 98.9]  25.0% 1:3 88.8% [87.7, 90.1] 75.8% [73.7, 78.3] 99.1% [99.0, 99.2]  33.0% 1:2 84.1% [82.4, 85.9] 68.3% [65.7, 71.0] 99.5% [99.4, 99.5]  40.0% 2:3 79.5% [77.4, 81.8] 61.1% [57.9, 64.5] 99.7% [99.6, 99.7]  50.0% 1:1 71.3% [68.3, 74.4] 50.2% [46.4, 53.8] 99.8% [99.8, 99.8]  60.0% 3:2 61.2% [57.6, 64.9] 39.9% [35.7, 43.8] 99.9% [99.9, 99.9]  75.0% 3:1 40.5% [36.5, 46.1] 23.2% [19.6, 27.2] 100.0% [100.0, 100.0] Extended Data Table 5 | Future and cross-site generalisability experiments. (a) Model performance when trained before the time point tP and tested after tP , both on the entirety of the future patient population as well as subgroups of patients for which the model has or hasn’t seen historical information during training. The model maintains a comparable level of performance on unseen future data, with a higher level of sensitivity of 59% for a time window of 48 hours ahead of time and a precision of two false positives per step for each true positive. The ranges correspond to bootstrap pivotal 95% confidence intervals with n=200. Note that this experiment is not a replacement for a prospective evaluation of the model. (b) Cohort statistics for (a), shown for both before and after the temporal split tP that was used to simulate model performance on future data. (c) Comparison of model performance when applied to data from previously unseen hospital sites. Data was split across sites so that 80% of the data was in group A and 20% in group B. No site from group B was present in group A and vice versa. The data was split into training, validation, calibration and test in the same way as in the other experiments. The table reports model performance when trained on site group A when evaluating on the test set within site group A versus the test set within site group B for predicting all AKI severities up to 48 hours ahead of time. Comparable performance is seen across key all key metrics. 95% bootstrap pivot confidence intervals are calculated using n=200 bootstrap samples. Note that the model would still need to be retrained to generalise outside of the VA population to a different demographic and a different set of clinical pathways and hospital processes elsewhere. a Patient cohorts  Metric [95% CI] Before tp (test) New admissions after tp (test) Subsequent admissions after tp All patients after tp  Sensitivity (AKI episode) 55.09 [54.01, 56.06] 59 [57.11, 60.71] 59.04 [58.38, 59.63] 58.97 [58.33, 59.52]  ROC AUC 92.25 [92.01, 92.42] 90.19 [89.76, 90.77] 89.98 [89.83, 90.17] 89.98 [89.81, 90.14]  PRAUC 29.97 [28.61, 31.15] 30.75 [28.65, 32.81] 31.54 [30.87, 32.30] 31.28 [30.44, 32.02]  Sensitivity (step) 34.26 [33.17, 35.28] 36.87 [35.2, 38.85] 37.23 [36.67, 37.88] 37.08 [36.40, 37.65]  Specificity (step) 98.55 [98.50, 98.60] 97.66 [97.54, 97.76] 97.63 [97.58, 97.68] 97.64 [97.59, 97.68]  Precision 32.51 [31.44, 33.21] 32.66 [31.2, 34.03] 32.97 [32.52, 33.47] 32.84 [32.28, 33.33] b Before tp After tp  Patients  Number of patients 599,871 246,406  Average age* 61.3 64.2  Admissions within a given period  Unique admissions 2,134,544 364,778  ICU admissions 226,585(10.62%) 40,102 (10.99%)  Medical admissions 1,040,923 (48.77%) 170,383 (46.71%)  Surgical admissions 373,823(17.51%) 67,617 (18.54%)  No creatinine measured 458,486 (21.48%) 52,115 (14.29%)  Any Chronic Kidney Disease 774,883 (36.30%) 156,181 (42.82%)  Any AKI present 282,398(13.23%) 41,950 (14.59%) c  Metric [95% Cl] Site group A Site group B  Sensitivity (AKI episode) 55.6% [54.5, 56.6] 54.6% [52.8, 56.3]  ROC AUC 91.8% [91.6, 92.1] 91.3% [90.8, 91.7]  PRAUC 30.0% [28.6, 31.2] 30.6% [28.3, 32.7]  Sensitivity (step) 34.3% [33.1, 35.2] 34.7% [32.6, 36.2]  Specificity (step) 98.5% [98.4, 98.5] 98.3% [98.2, 98.4] Extended Data Table 6 | Summary statistics for the data. A breakdown of training (80%), validation (5%), calibration (5%) and test (10%) datasets by both unique patients and individual admissions. Where appropriate, percent of total dataset size is reported in parentheses. The dataset was representative of the overall VA population for clinically relevant demographics and diagnostic groups associated with renal pathology. *Average age after taking into account exclusion criteria and statistical noise added to meet HIPAA Safe Harbor criteria. **CKD stage 1 is evidence of renal parenchymal damage with a normal glomerular filtration rate (GFR). This is rarely recorded in our dataset; instead the numbers for stage 1 CKD have been estimated from admissions that carried an ICD-9 code for CKD, but where GFR was normal. For this reason these numbers may under-represent the true prevalence in the population. ***172 VA inpatient sites and 1,062 outpatient sites were eligible for inclusion. 130 data centres aggregate data from one or more of these facilities, of which 114 such data centres had data for inpatient admissions used in this study. While the exact number of sites included was not provided in the dataset for this work, no patients were excluded based on location. Training Validation Calibration Test Patients Unique patients 562,507 35,277 35,317 70,681 Average age* 62.4 62.5 62.4 62.3 Ethnicity Black 106,299(18.9%) 6,544(18.6%) 6,675(18.6%) 13,183 (18.7%) Other 456,208 (81.1%) 28,733 (81.4%) 28,642 (81.4%) 57,498 (81.3%) Gender Female 35,855 (6.4%) 2,300 (6.5%) 2,252 (6.4%) 4,519 (6.4%) Male 526,652 (93.6%) 32,977 (93.5%) 33,065 (93.6%) 66,162 (93.6%) Diabetes 56,958(10.1%) 3,599(10.2%) 3,702(10.5%) 7,093 (10.0%) Admissions within a five year period Data center sites 130*** 130*** 130*** 130*** Unique admissions per patient 2,004,217 124,255 125,928 252,492 Average 3.6 3.5 3.6 3.6 Median 2 2 2 2 Duration (days) Average 9.6 9.6 9.6 9.6 Median 3.2 3.2 3.2 3.2 ICU admissions 214,644(10.7%) 13,161 (10.6%) 13,411 (10.6%) 26,739 (10.6%) Medical admissions 971,527 (48.5%) 60,762 (48.9%) 61,281 (48.7%) 121,675 (48.2%) Surgical admissions 354,008(17.7%) 21,857(17.6%) 22,093(17.5%) 44,766 (17.7%) Renal replacement therapy 22,284(1.1%) 1,367(1.1%) 1,384(1.1%) 2,784 (1.1%) No creatinine measured 408,927 (20.4%) 25,162 (20.3%) 25,503 (20.3%) 51,484 (20.4%) Chronic Kidney Disease Any 746,692 (37.3%) 46,677 (37.5%) 46,622 (37.0%) 94,105 (37.3%) Stage 1** 8,409 (0.4%) 515 (0.4%) 576 (0.5%) 1,103 (0.4%) Stage 2 429,990 (21.5%) 27,162 (21.9%) 26,927 (21.4%) 54,476 (21.6%) Stage 3A 156,720 (7.8%) 9,837 (7.9%) 9,803 (7.8%) 19,548 (7.7%) Stage 3B 77,801 (3.9%) 4,675 (3.8%) 4,823 (3.7%) 9,760 (3.9%) Stage 4 50,535 (2.5%) 3,004 (2.5%) 3,066 (2.5%) 6,223 (2.5%) Stage 5 31,646(1.6%) 1,999(1.6%) 2,003(1.6%) 4,098 (1.6%) AKI present Any AKI 267,396(13.3%) 16,671 (13.4%) 16,760(13.3%) 33,759 (13.4%) Stage 1 207,441 (10.4%) 12,794(10.3%) 12,951 (10.3%) 26,215(10.4%) Stage 2 43,446 (2.2%) 2,780 (2.2%) 2,783 (2.2%) 5,575 (2.2%) Stage 3 66,734 (3.3%) 4,267 (3.4%) 4,162 (3.3%) 8,453 (3.3%) Supplementary Material 1 2
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Clinical Decision Support in the Era of Artificial Intelligence

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The global burden of chronic kidney disease: estimates, variability and pitfalls

              Chronic kidney disease (CKD) is currently defined by abnormalities of kidney structure or function assessed using a matrix of variables — including glomerular filtration rate (GFR), thresholds of albuminuria and duration of injury — and is considered by many to be a common disorder globally.
                Bookmark

                Author and article information

                Journal
                J Clin Med
                J Clin Med
                jcm
                Journal of Clinical Medicine
                MDPI
                2077-0383
                13 April 2020
                April 2020
                : 9
                : 4
                : 1107
                Affiliations
                [1 ]Division of Nephrology, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; charat.thongprayoon@ 123456gmail.com (C.T.); api.che@ 123456hotmail.com (A.C.)
                [2 ]Department of Military and Community Medicine, Phramongkutklao College of Medicine, Bangkok 10400, Thailand; wisitnephro@ 123456gmail.com
                [3 ]Division of Nephrology, Department of Medicine, University of Mississippi Medical Center, Jackson, MS 39216, USA; kkovvuru@ 123456umc.edu (K.K.); skanduri@ 123456umc.edu (S.R.K.); mgonzalezsuarez@ 123456umc.edu (M.L.G.-S.)
                [4 ]Department of Internal Medicine, University of Pittsburgh Medical Center Pinnacle, Harrisburg, PA 17105, USA; hansrivijitp@ 123456upmc.edu
                [5 ]Department of Internal Medicine, University of Arizona, Tucson, AZ 85721, USA; tarunjacobb@ 123456gmail.com
                [6 ]Department of Nephrology, Department of Medicine, Saint Luke’s Health System, Kansas City, MO 64111, USA; napat.leeaphorn@ 123456gmail.com
                Author notes
                [* ]Correspondence: wcheungpasitporn@ 123456gmail.com ; Tel.: +1-601-984-5670; Fax: +1-601-984-5765
                Author information
                https://orcid.org/0000-0003-2920-7235
                https://orcid.org/0000-0002-1707-9988
                https://orcid.org/0000-0002-8930-4611
                https://orcid.org/0000-0001-9954-9711
                Article
                jcm-09-01107
                10.3390/jcm9041107
                7230205
                32294906
                002e8ebb-45fd-4fd1-afbc-50d31af80f8a
                © 2020 by the authors.

                Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 01 April 2020
                : 09 April 2020
                Categories
                Editorial

                artificial intelligence,machine learning,big data,nephrology,transplantation,kidney transplantation,acute kidney injury,chronic kidney disease

                Comments

                Comment on this article