Temporal Label Smoothing
for Early Prediction of Adverse Events

Hugo Yèche

^{1}

&Alizée Pace

^{1, 2, 3}

¹ &Gunnar Rätsch

^{1}

&Rita Kuznetsova

^{1}

²
{hyeche,alpace,raetsch,mkuznetsova}@inf.ethz.ch

^{1}

Department of Computer Science, ETH Zürich

^{2}

ETH AI Center, ETH Zürich

^{3}

Max Planck Institute for Intelligent Systems, Tübingen, Germany
Equal contributionCo-supervised

¹footnotemark: 1

²footnotemark: 2

Abstract

Models that can predict adverse events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging machine learning task remains typically treated as simple binary classification, with few bespoke methods proposed to leverage temporal dependency across samples. We propose Temporal Label Smoothing (TLS), a novel learning strategy that modulates smoothing strength as a function of proximity to the event of interest. This regularization technique reduces model confidence at the class boundary, where the signal is often noisy or uninformative, thus allowing training to focus on clinically informative data points away from this boundary region. From a theoretical perspective, we also show that our method can be framed as an extension of multi-horizon prediction, a learning heuristic proposed in other early prediction work. TLS empirically matches or outperforms considered competing methods on various early prediction benchmark tasks. In particular, our approach significantly improves performance on clinically-relevant metrics such as event recall at low false-alarm rates.

1 Introduction

Early prediction of adverse events is key to safety-critical operations such as clinical care [1] or environmental monitoring [2]. In particular, adverse event prediction is highly relevant to clinical decision-making, as the deployment of in-patient risk stratification models can significantly improve patient outcomes and facilitate resource planning [1]. For instance, the National Early Warning Score (NEWS), a simple rule-based model predicting acute deterioration in critical care units, has been demonstrated to reduce in-patient mortality [3; 4].

Deteriorating patient signals are often identified by mining large quantities of existing medical data and associated patient outcomes, which has sparked a growing interest in machine learning and medical literature. Applications of such adverse event prediction models include alarm systems for delirium [5], septic shock [6], as well as circulatory or kidney failure in the intensive care unit (ICU) [7; 8].

Adverse event prediction remains a challenging modelling task requiring specific technical solutions. Recent years have seen the development of deep learning architectures for electronic health records (EHR), which help tackle the high dimensionality, irregular sampling, and informative missingness patterns in patient covariates [6; 9; 10; 8]. Still, adverse clinical events are often noisy, infrequent, and, as illustrated in Figure 1, must be predicted with enough anticipation to allow for appropriate physician response – yet early prediction remains largely considered a simple binary classification task [7; 9; 8].

As a result, current decision support models often suffer from high false positive prediction rates, with associated risks of alarm fatigue and thus limited physician engagement [11; 12; 1]. As highlighted in Figure 1(a), the traditional cross-entropy objective results in highest error rates near the class boundary, corresponding to the prediction horizon before the event. Data in this boundary region dominates the loss but may not be clinically discriminative of patient deterioration patterns. Motivated by this observation, we propose Temporal Label Smoothing (TLS), a novel regularization strategy making label smoothing [13] time-dependent to better match prediction uncertainty patterns over time. As visualized in Figure 1(b), our method is designed to reduce model confidence with stronger smoothing at the class boundary, allowing training to focus on more clinically informative data points away from this noisily labelled region.

Contributions.

The contributions of our work are threefold: (i) In Section 3.2, we introduce a novel label smoothing method¹¹1All code is made publicly available at https://anonymous.4open.science/r/tls/., which leverages the temporal structure of early prediction tasks to focus training and model confidence on areas with stronger predictive signal. (ii) In Section 5, we show that our approach improves prediction performance over previously proposed objectives, particularly for clinically relevant criteria. (iii) In Section 3.3, we bridge the gap between prior work on multi-horizon prediction (MHP) [8] and label smoothing [13] by showing the former is equivalent to a special case of TLS under reasonable assumptions that we verify empirically.

(a) Timestep performance of regular cross-entropy training for decompensation on MIMIC-III.

2 Related work

Recent years have seen the development of custom machine learning methods to predict expected patient evolution and support clinical decision-making [14; 15; 16; 7]. Amongst these, early prediction of adverse clinical events is a particularly complex task due to their typically rare occurrence and noisy label definition, which induces challenging, highly imbalanced datasets for model training [8]. As a result, prediction systems often suffer from high false-alarm rates with limited usefulness in the clinical context [1]. Prior works on early event prediction have adopted various approaches to tackle this issue, which we compare in Table 1 and formalize in Appendix A.3. We also discuss similarities and distinctions between our task and the framework of survival analysis [17] in Appendix A.2.

Learning objectives for imbalanced datasets.

Class imbalance is often addressed through loss reweighting techniques. Static class reweighting was used for sepsis or circulatory failure prediction [16; 7] through a balanced cross-entropy, which assigns a higher weight to samples from the minority class [18]. Still, performance improvements with this objective remain limited on highly imbalanced prediction tasks [19]. In contrast, dynamic reweighting methods such as focal loss and extensions [20; 21] induce a learning bias towards samples with high model uncertainty, typically harder to classify. This approach can improve the prediction of disease progression from imbalanced datasets [22] but does not consider patterns of sample informativeness over time.

Multi-horizon prediction.

In contrast, other early prediction models learn to leverage temporal trends in the data by outputting event predictions over several horizons [8; 23; 24]. This training heuristic improves prediction performance on the horizon of interest but scales poorly with the number of output horizons. In Section 3.3, we highlight that TLS can induce a similar temporal bias in learning while overcoming scalability limitations.

Related work	Temporal	Computationally	Impacts sample	Loss
Related work	inductive bias	scalable	optimum	for class $c \in {0, 1}$
Cross-entropy loss	✗	✓	✗	$δ_{y = c} log (^y)$
Balanced cross-entropy loss [18]	✗	✓	✗	$ω_{y} δ_{y = c} log (^y)$
Focal loss [20]	✗	✓	✗	$ω_{y} (1 -^y)^{ζ} δ_{y = c} log (^y)$
Label smoothing [13]	✗	✓	✓	$q^{L S} (c \| y) log (^y)$
Multi-horizon prediction [8]	✓	✗	✓	$\sum_{h} y^{h} log ({^y}^{h})$
Temporal label smoothing	✓	✓	✓	$q^{T L S} (c \| y, t) log (^y)$

Table 1: Related work. Comparison to different training objectives for binary early prediction tasks.

y \in {0, 1}

corresponds to a sample’s true label at time

t

and

^y \in [0, 1]

to the model’s prediction.

Label smoothing.

For greater generalization of models applied to heterogeneous real-world data, another well-known training strategy is to avoid model overconfidence through label smoothing [13]. This regularization technique improves both the calibration of deep learning models [25] and their performance under noisy labelling [26; 25]. Still, despite extensions including novel prior distributions over classes [27] or modifications to the objective itself [28; 29], label smoothing remains designed for classification problems with i.i.d. samples, ill-adapted to the time-dependent nature of our data. To the best of our knowledge, we are the first work to explore adding a temporal dependence to label smoothing and empirically demonstrate the added value of this approach.

Whereas reweighted loss functions only bias learning towards minority or uncertain data points, multi-horizon prediction and label smoothing approaches alter the individual sample optimum. As a consequence, these approaches avoid model overconfidence and are thus more robust to noisy labelling [26]. In this work, we propose to combine the respective advantages of these established methods in a novel way to improve early prediction of adverse events.

3 Method

We first formalize the problem of early adverse event prediction and introduce temporal label smoothing. We then highlight how MHP can be framed as a special case of TLS.

3.1 Problem formalism

We assume access to a dataset of $N$ patient stays. These consist of irregular time series of high-dimensional patient covariates $X_{i, t} = [x_{i, 0}, \dots, x_{i, t}]$ and binary event labels $e_{i, t}$ encoding whether a patient of index $i$ is undergoing an adverse event of interest at time $t$ . For each patient, we thus have a sequence ${(x_{i, 1}, e_{i, 1}), \dots, (x_{i, T_{i}}, e_{i, T_{i}})}$ of length $T_{i}$ .

Our early prediction task consists of modelling a binary target variable $y_{i, t}$ , which is positive if the event occurs within a given prediction horizon $h$ . For labelling purposes, we define a time-to-event variable at each time point, $t_{e} (i, t) = {a r g m i n}_{τ : τ \geq t} {e_{i, τ} : e_{i, τ} = 1}$ . If patient $i$ never undergoes any event, we set $t_{e} (i, t) = + \infty$ . Thus, we have:

y_{i, t} = ⎧ ⎨ ⎩ \begin{matrix} 0 & if t < t_{e} - h 1 & if t_{e} - h < t < t_{e} NaN & if t_{e} = t \end{matrix}

(1)

As our task focuses specifically on early modeling for clinical relevance, no prediction is carried out if the patient is currently undergoing the event. Then, as for any binary deep learning problem, we define a model $f$ parameterized by $θ$ with ${^y}_{i, t} = f_{θ} (X_{i, t}) = p_{θ} (y_{i, t} = 1)$ . We denote the optimal set of parameters minimizing the objective function as $θ^{*}$ , giving $y_{i, t}^{*} = f_{θ^{*}} (X_{i, t})$ .

Temporal structure.

An important distinction must be made with the classification tasks typically addressed with label smoothing. In adverse event prediction, data is not independent and identically distributed (i.i.d.) as each sample $x_{i, t}$ depends on a timestep $t$ and a patient stay indexed as $i$ . Contiguous samples within a common stay are thus dependent in time:

p (y_{i, t + d} = 1) \geq p (y_{i, t} = 1) \forall d \in [0, t_{e} (i, t) - t [

(2)

Our goal is to leverage this structure in our data to focus training on relevant timesteps and help address issues of noisy label boundaries and class imbalance, which are inherent to our choice of real-world medical datasets.

3.2 Temporal label smoothing

As introduced by Szegedy et al. [13], label smoothing consists of substituting the original label distribution, $δ_{y_{i} = c}$ for class $c$ , with a smooth version $q^{L S} (c | y_{i})$ in the cross-entropy objective $L_{i} = L^{C E} (y_{i}, {^y}_{i})$ . For binary tasks, label smoothing becomes a linear interpolation:

q^{L S} (1 | y_{i}) = (1 - α) y_{i} + α (1 - y_{i})

(3)

where parameter $α$ controls the smoothing strength.

By shifting the minimum of the objective function away from $y_{i}^{*} = y_{i}$ towards $y_{i}^{*} = q^{L S}$ , label smoothing prevents models from becoming overconfident during training. This approach should therefore help improve the robustness of early prediction models against the inherently noisy nature of the task [26] but does not account for the time dependency between samples of a given stay. For this purpose, we propose temporal label smoothing, an approach to modulate smoothing based on time $t$ to infuse this prior knowledge into the training objective. We define the corresponding surrogate distribution similarly to label smoothing:

q^{T L S} (1 | i, t) = 1 - α (i, t)

(4)

For early prediction of events, to enforce the temporal inductive bias in Equation 2, we parametrize $α (i, t)$ as a monotonous decreasing function of $t \in [0, t_{e} (i, t)]$ . In practice, as illustrated in Figure 2(a), this increases smoothing strength around the label boundary $t = t_{e} - h$ , reducing prediction certainty in this region prone to high error rates, as shown in Figure 1(a).

(a) Parametrization $α^{e x p}$ (Equation 5).

Smoothing parametrizations.

We propose various temporal smoothing parametrizations for $α (i, t)$ in Appendix A.2. Experimental results suggest that an exponential parametrization, defined as follows, performs best on considered tasks. Corresponding smoothed labels $q^{e x p} (1 | i, t)$ can be visualized in Figure 1(b).

α^{e x p} (i, t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} 1 - e^{- γ (t_{e} (i, t) - t - d)} - A & if h_{m i n} < t_{e} (i, t) - t < h_{m a x} 0 & if t_{e} (i, t) - t \leq h_{m i n} 1 & if t_{e} (i, t) - t \geq h_{m a x} \end{matrix}

(5)

Parameters $h_{m i n}$ and $h_{m a x}$ define the time range over which we apply smoothing, namely $[t_{e} - h_{m a x}$ , $t_{e} - h_{m i n}]$ . Under this constraint, parameters ${d, A}$ are defined to enforce $α (i, t)$ to be continuous at boundary points (see Appendix A.2). Finally, $γ$ controls the smoothing strength at a given time.

3.3 Link with multi-horizon prediction

As motivated above, temporal label smoothing adapts the contribution of each sample to reflect prior knowledge about the temporal structure of event prediction labels. In this section, we find that MHP leverages the same information in Equation 2 to teach the model to predict event over multiple horizons/ [8]. Under simplifying assumptions justified empirically in Section 5.2, we show that this approach can be seen as a special case of temporal label smoothing with a ‘staircase’ parametrization.

In this framework, the unique label $y_{i, t}$ associated with patient covariates $X_{i, t}$ , for an horizon of interest $h$ , is replaced by a vector $y_{i, t} = [y_{i, t}^{h_{1}}, \dots, y_{i, t}^{h}, \dots, y_{i, t}^{h_{H}}]$ corresponding to $H$ distinct horizons. The prediction model is thus adapted to output ${^y}_{i, t} = [{^y}_{i, t}^{h_{1}}, \dots, {^y}_{i, t}^{h}, \dots, {^y}_{i, t}^{h_{H}}]$ . For temporal consistency between samples, Tomašev et al. [8] enforce predictions to be monotonically increasing over time, such that $h_{u} \leq h_{v} ⟹ {^y}_{i, t}^{h_{u}} \geq {^y}_{i, t}^{h_{v}}$ . With these additional components, the training objective for patient $i$ becomes $L_{i}^{M H P} = - \frac{1}{H} \sum_{k = 1}^{H} y_{i, t}^{h_{k}} log ({^y}_{i, t}^{h_{k}}) + (1 - y_{i, t}^{h_{k}}) log (1 - {^y}_{i, t}^{h_{k}})$ .

Proposition 1.

Under the assumption that model outputs ${^yhki,t}k$ are equal for all ${h_{k}}_{k}$ (rather than monotonically increasing), MHP is equivalent to temporal label smoothing parameterized with $α^{s t e p} (i, t)$ . This function, illustrated in Figure 2(b), is defined as the following sequence of step functions in time:

α^{s t e p} (i, t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{k}{H} & i f h_{k} \leq t_{e} (i, t) - t < h_{k + 1} \forall k \leq H - 1 0 & i f t_{e} (i, t) - t \leq h_{1} 1 & i f t_{e} (i, t) - t > h_{H} \end{matrix}

(6)

Proof.

See Appendix A.1.∎

Proposition 1 frames MHP as a special case of TLS with step-function parametrization. We empirically justify the equal-output assumption through an ablation study in Section 5.2.

4 Experimental setup

4.1 Early prediction tasks

We demonstrate the effectiveness of our method on three clinical early prediction tasks, inspired by existing literature and published benchmarks. All tasks deal with electronic health records from the ICU, where early prediction of organ failure or acute deterioration is critical to patient management [1].

Our work is first benchmarked on the prediction of acute circulatory failure and mild respiratory failure within the next $h = 12$ hours. These tasks are part of HiRID-ICU-Benchmark (HiB) [19], built on the publicly available HiRID dataset [7]. The dataset contains high-resolution observations of over 33,000 ICU admissions. Our third evaluation task consists of early prediction of patient mortality, or decompensation, within a horizon of $h = 24$ hours. Although less clinically relevant, this task has been widely studied in the machine learning literature [30]. Defined in the MIMIC-III Benchmark (M3B) [31], this task originates from the widely used MIMIC-III dataset [32], counting approximately 40,000 patient stays.

All three clinical events are labelled following internationally accepted criteria as in Harutyunyan et al. [31] and Yèche et al. [19]. Positive label prevalence is 4.3%, 38.6%, and 2.1% of timepoints for circulatory, respiratory failure, and decompensation prediction respectively – with rarer events associated with more severe states, in this instance. Further details on task definition and data pre-processing are provided in Appendix B.

4.2 Benchmarking strategy

Baselines.

We quantify the added value of our method by comparing its performance to alternative learning approaches used for early event prediction, discussed in Section 2. Our first baselines consist of balanced cross-entropy [18] and focal loss [20], popular sample reweigthing methods for imbalanced tasks. We also implement multi-horizon prediction as a multi-output model trained to predict event occurrence over different horizons between $0$ and $2 h$ . Note that for a fair comparison, we set $(h_{m i n}, h_{m a x}) = (0, 2 h)$ in TLS. As in Tomašev et al. [8], a cumulative distribution function layer on logits enforces monotonicity of predictions (Eq. 2). Finally, we also compare our method to conventional label smoothing [13] to confirm that a temporal dependency does improve performance.

Hyperparameter tuning.

Hyperparameters introduced by our method, such as strength term $γ$ in smoothing parametrization $α^{e x p}$ (Equation 5), are optimized through grid searches on the validation set. The same approach is adopted for hyperparameters specific to each baseline, as shown in Figure 4.

Architecture choice.

As our method and baselines are model-agnostic and only vary in terms of optimization objective, a unique model architecture is used for each task, selected through a random search on cross-entropy validation performance. Following a published benchmark on the HiRID dataset [19], we use a GRU [33] and transformer [34] architecture for the circulatory and respiratory failure tasks respectively. For decompensation prediction, transformers outperform the LSTM-based models [35] originally proposed in the M3B benchmark [31], and are thus used in our work. As recommended by Tomašev et al. [8], we apply $l_{1}$ -regularization to input embedding layers, which improves performance on all tasks. Further implementation details are provided in Appendix C.

4.3 Evaluation metrics

To account for the imbalanced nature of clinical early prediction tasks, model performance is often reported through the area under the receiver operating characteristic curve (AUROC). Although this widely-used metric can be informative for moderate imbalances, the area under the precision-recall curve (AUPRC) provides more insight for our tasks: under a low prevalence of positive samples, precision is more sensitive to false alarms than specificity [36]. Still, "area under the curve" metrics can be poorly representative of clinical usefulness, as improvements in low precision regions can dominate such global metrics but remain incompatible with the low false alarm rates required for clinical deployment. Thus, to better assess model performance in this context, we also measure performance at a clinically motivated operating point through recall at 50% precision [23].

In addition to timestep-level metrics, which measure prediction performance at each data point, we also evaluate models in an event-based approach. Following Tomašev et al. [8]’s definition, an event prediction is positive if the model outputs a positive prediction at any time over the $h$ hours before the event. The threshold defining a positive prediction is chosen based on a precision lower-bound: in practice, we use a 50% stepwise precision criterion. This allows us to measure the event recall of our approach in comparison to published baselines. Unless stated otherwise, we always report mean performance with 95% confidence intervals computed over ten training runs.

5 Results

Task	Circulatory Failure		Decompensation		Respiratory Failure
Method	AUPRC	Recall	AUPRC	Recall	AUPRC	Recall
Cross-entropy	39.1 $\pm$ 0.4	29.3 $\pm$ 0.9	34.5 $\pm$ 0.4	28.2 $\pm$ 0.5	$60.5$ $\pm$ 0.2	$77.3$ $\pm$ 0.5
Label Smoothing [13]	39.3 $\pm$ 0.4	29.9 $\pm$ 0.8	33.9 $\pm$ 0.3	27.7 $\pm$ 0.5	$60.1$ $\pm$ 0.2	$76.6$ $\pm$ 0.5
Multi-horizon [8]	39.6 $\pm$ 0.5	30.3 $\pm$ 1.0	$34.9$ $\pm$ 0.3	$28.6$ $\pm$ 0.5	$60.3$ $\pm$ 0.1	$76.6$ $\pm$ 0.5
Temporal Label Smoothing	$40.6$ $\pm$ 0.3	$32.3$ $\pm$ 0.7	$35.5$ $\pm$ 0.3	$29.3$ $\pm$ 0.4	$60.4$ $\pm$ 0.2	$77.0$ $\pm$ 0.3
$p$ -value $(H_{0} : T L S > M H P)$	0.00	0.00	0.00	0.02	0.15	0.14

Table 2: Timestep-level performance on different early prediction tasks. Recall is reported at a 50% precision. Circulatory and respiratory failure are predicted on the HiB dataset, decompensation on M3B. In bold, we report methods within the confidence interval of the best performing one and statistically significant

p

-values (

< 0.05

) from paired Student’s t-tests [37].

5.1 Prediction performance

Overall, our results highlight that TLS improves performance over other approaches proposed to address the challenges of early clinical prediction. In Table 2, we find TLS to outperform other baselines across all metrics for both circulatory failure and decompensation. Despite overlapping confidence intervals between multi-horizon and TLS on decompensation due to individual training run variability, our method remains statistically superior under a t-test. Full precision-recall curves are given in Figures 4(a) and 13. We discuss the trade-offs and limitations imposed by these custom objectives, as evidenced by the lack of improvement in respiratory failure task, in Section 5.3. In contrast, as illustrated in Figure 4,

Figure 4: Performance loss with class reweighting methods, on the validation set for circulatory failure prediction. Weighted cross-entropy corresponds to $ζ = 0$ .

loss reweighting methods designed to tackle class imbalance were found to reduce performance on all tasks over traditional cross-entropy. For weighted cross-entropy, we attribute it to the increase in false alarms resulting from the drive to improve recall. It further reduces the low precision of all models, thus negatively affecting the AUPRC (as visualized in Appendix D.3). On the other hand, focal loss down-weighs confident samples in training, constraining the model to focus on samples with uncertain predictions. In the context of noisy labeling, as is the case close to our class boundary, data points with ambiguous signals cannot be correctly predicted and thus dominate the loss, impeding improvements in other regions of input space. We analyze model performance over time in Section 5.2 to further support this hypothesis.

Clinically-relevant performance.

We also compare the full precision-recall curve of models trained with these different objectives in Figure 4(a) – note that we obtain comparable results for decompensation prediction in Appendix D. In addition to visually confirming the numerical results in Table 2, we find that our training objective affords particular performance improvements in the clinically-relevant region corresponding to high precision or low false-alarm rates [1].

(a) Precision-recall curve. Inset shows the clinically-applicable region with precision $> 50$ %.

Event-based analysis.

Finally, as highlighted in Figure 4(b), TLS improves performance in terms of predicting overall adverse event episodes throughout a stay on all prediction tasks. This suggests that performance improvements at the timestep level affect a large number of events and translate to better event detection. Indeed, we demonstrate in Section 5.2 that TLS affords larger performance gains close to the event time, thus leading to a better recall of imminent events.

5.2 Illustrative insights

We propose ablations and analyses to build intuition around our proposed method. In particular, we aim to highlight how temporal smoothing works and why it outperforms other training approaches for early prediction tasks.

Performance over time.

In Figure 6, we compare the performance difference between our method, TLS, and the regular cross-entropy objective over time – previously studied in Figure 1(a). We perform the same analysis in Appendix D for other tasks. As expected, the prediction model trained with TLS is less competitive where label smoothing is strongest, near $t_{e} - h$ , but this performance loss remains minor even with significant smoothing. This result validates our hypothesis that the signal is too noisy in the boundary region for any model to recover the original label distribution. In contrast, away from the label boundary, TLS results in a significant increase in true positive and negative rates. From a clinical perspective, errors made in the boundary region are less critical, as they result in the latest false positives or earliest false negatives. Consequently, TLS not only improves global event prediction performance but allows these gains to occur at more critical times for clinicians.

Empirical comparison to multi-horizon prediction.

In our theoretical discussion in Section 3.3, we demonstrated how MHP is a restriction of label smoothing with a step function $α^{s t e p} (i, t)$ . This claim relies on the constraint to produce a unique prediction across all considered horizons, reflecting the design of our method. We verify the impact of this assumption by measuring performance gains afforded by learning distinct predictions per horizon. As shown in Table 3, with full precision-recall curves in Figure 18, we find no statistical evidence for performance gain over using $α^{s t e p}$ on all tasks and studied metrics. Thus, models do not appear to leverage this additional flexibility offered by MHP. With superior results on all timestep- and event-based experiments, and greater scalability thanks to the single prediction horizon modeled, we find temporal label smoothing to be a superior training objective to MHP in early prediction tasks.

Task	Circulatory Failure		Decompensation		Respiratory Failure
Method	AUPRC	Recall	AUPRC	Recall	AUPRC	Recall
MHP	39.6 $\pm$ 0.5	30.3 $\pm$ 1.0	34.9 $\pm$ 0.3	28.6 $\pm$ 0.5	$60.3$ $\pm$ 0.1	$76.6$ $\pm$ 0.5
TLS ( $α^{s t e p}$ )	39.3 $\pm$ 0.2	29.4 $\pm$ 0.8	35.2 $\pm$ 0.3	29.2 $\pm$ 0.4	60.5 $\pm$ 0.1	$77.4$ $\pm$ 0.5
p-value ( $H_{0}$ )	0.11	0.10	0.95	0.97	0.99	0.98

Table 3: Do MHP’s multiple outputs improve performance over TLS with

q^{s t e p}

? We provide

p

-values for the paired Student-t test [37] on the null hypothesis

H_{0} : μ_{M H P} \geq μ_{s t e p}

. With no statistically significant improvements (

p < 0.05

), we justify our assumption in Proposition 1.

5.3 Trade-offs and limitations

Despite the demonstrated advantage of our training paradigm for two distinct early prediction tasks, we observed no performance gain over traditional cross-entropy when predicting respiratory failure on HiB in Table 2. Although no other baseline improved learning on this task either, this observation motivated an analysis of the specific problem settings in which our objective helps.

Respiratory failure events are much more frequent than circulatory failure or decompensation, with the majority of ICU patients undergoing approximately two such events during their stay, as quantified in Appendix B.We hypothesize that this reduced class

Figure 7: Performance improvement over time for TLS over traditional cross-entropy, for respiratory failure. True positive rates (TPR) are computed for a precision of $0.5$ over 2-hour bins.

imbalance leads to sufficient discriminative information within the label boundary region. This belief is supported by the more significant performance loss close to $t_{e} - h$ with TLS compared to the other tasks, with a 1% drop in true positive rate (TPR) in Figure 7. However, as expected by design, our method improves recall (+1% TPR) over cross-entropy close to the event. This also leads to a non-negligible 0.5% improvement in event recall, visualized in Appendix D.2. Overall, this analysis reveals that whereas TLS has little impact on global metrics for close-to-balanced tasks, which remain quite rare in clinical decision support efforts [8], it still results in clinically meaningful performance improvements along per-horizon and event-based metrics.

6 Conclusion

Early prediction of adverse events is paramount to the development of clinical decision support systems, with a demonstrated potential to improve patient outcomes [3]. Still, this task remains poorly studied in the machine learning literature, with few training solutions tailored to address its challenges. Based on typically rare and noisy labels, models must learn to discriminate a predictive signal in anticipation of events to allow an adequate medical response.

After highlighting the limitations of traditional classification objectives and methods designed to address class imbalance, we propose a novel training framework that leverages trends in event signals over time. We show that multi-horizon prediction, a heuristic used to improve early prediction, can be formalized as a restriction of our framework. Simple but effective, temporal label smoothing empirically matches or outperforms all considered baselines on various tasks and datasets, with significant improvements on clinically-relevant evaluation metrics. Performance gains are limited, as with other baselines, for respiratory failure prediction in which higher event prevalence provides sufficient informative data points for the model to learn through a conventional cross-entropy objective. In further work, we aim to explicitly adapt the temporal inductive bias to the task at hand and to combine temporal label smoothing with recent objectives designed to directly optimize AUPRC, such as minimum precision constraint [38] or dice-based loss functions [39].

Looking ahead, we expect that temporal label smoothing will be leveraged to develop more clinically reliable systems for risk prediction of rare adverse events. Further research on tailored machine learning solutions to improve real-world decision support holds promise for better clinical care and operations management.

References

Sutton et al. [2020] Reed T Sutton, David Pincock, Daniel C Baumgart, Daniel C Sadowski, Richard N Fedorak, and Karen I Kroeker. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ digital medicine, 3(1):1–10, 2020.
Giuseppe et al. [2016] Francesca Di Giuseppe, Florian Pappenberger, Fredrik Wetterhall, Blazej Krzeminski, Andrea Camia, Giorgio Libertá, and Jesus San Miguel. The potential predictability of fire danger provided by numerical weather prediction. Journal of Applied Meteorology and Climatology, 55(11):2469 – 2491, 2016. doi: 10.1175/JAMC-D-15-0297.1. URL https://journals.ametsoc.org/view/journals/apme/55/11/jamc-d-15-0297.1.xml.
Smith et al. [2013] Gary B Smith, David R Prytherch, Paul Meredith, Paul E Schmidt, and Peter I Featherstone. The ability of the national early warning score (news) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. Resuscitation, 84(4):465–470, 2013.
Pullyblank et al. [2020] Anne Pullyblank, Alison Tavaré, Hannah Little, Emma Redfern, Hein le Roux, Matthew Inada-Kim, Kate Cheema, and Adam Cook. Implementation of the national early warning score in patients with suspicion of sepsis: evaluation of a system-wide quality improvement project. British Journal of General Practice, 70(695):e381–e388, 2020.
Wong et al. [2018] Andrew Wong, Albert T Young, April S Liang, Ralph Gonzales, Vanja C Douglas, and Dexter Hadley. Development and validation of an electronic health record–based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment. JAMA network open, 1(4):e181018–e181018, 2018.
Fagerström et al. [2019] Josef Fagerström, Magnus Bång, Daniel Wilhelms, and Michelle S Chew. Lisep lstm: a machine learning algorithm for early detection of septic shock. Scientific reports, 9(1):1–8, 2019.
Hyland et al. [2020] Stephanie L Hyland, Martin Faltys, Matthias Hüser, Xinrui Lyu, Thomas Gumbsch, Cristóbal Esteban, Christian Bock, Max Horn, Michael Moor, Bastian Rieck, et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nature medicine, 26(3):364–373, 2020.
Tomašev et al. [2019] Nenad Tomašev, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham, Andre Saraiva, Anne Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 572(7767):116–119, 2019.
Horn et al. [2020] Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten M. Borgwardt. Set functions for time series. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4353–4363. PMLR, 2020. URL http://proceedings.mlr.press/v119/horn20a.html.
Shukla and Marlin [2021] Satya Narayan Shukla and Benjamin M. Marlin. Multi-time attention networks for irregularly sampled time series. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=4c0J6lwQ4_.
Cvach [2012] Maria Cvach. Monitor alarm fatigue: an integrative review. Biomedical instrumentation & technology, 46(4):268–277, 2012.
Sendelbach and Funk [2013] Sue Sendelbach and Marjorie Funk. Alarm fatigue: a patient safety concern. AACN advanced critical care, 24(4):378–386, 2013.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308.
Kourou et al. [2015] Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, and Dimitrios I. Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8–17, 2015. ISSN 2001-0370. doi: https://doi.org/10.1016/j.csbj.2014.11.005. URL https://www.sciencedirect.com/science/article/pii/S2001037014000464.
Xiao et al. [2019] Jing Xiao, Ruifeng Ding, Xiulin Xu, Haochen Guan, Xinhui Feng, Tao Sun, Sibo Zhu, and Zhibin Ye. Comparison and development of machine learning tools in the prediction of chronic kidney disease progression. Journal of translational medicine, 17(1):1–13, 2019.
Futoma et al. [2017] Joseph Futoma, Sanjay Hariharan, and Katherine A. Heller. Learning to detect sepsis with a multitask gaussian process RNN classifier. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1174–1182. PMLR, 2017. URL http://proceedings.mlr.press/v70/futoma17a.html.
Cox [1972] D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187–220, 1972. ISSN 00359246. URL http://www.jstor.org/stable/2985181.
King and Zeng [2001] Gary King and Langche Zeng. Logistic regression in rare events data. Political analysis, 9(2):137–163, 2001.
Yèche et al. [2021] Hugo Yèche, Rita Kuznetsova, Marc Zimmermann, Matthias Hüser, Xinrui Lyu, Martin Faltys, and Gunnar Rätsch. Hirid-icu-benchmark–a comprehensive machine learning benchmark on high-resolution icu data. arXiv preprint arXiv:2111.08536, 2021.
Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
Leng et al. [2022] Zhaoqi Leng, Mingxing Tan, Chenxi Liu, Ekin Dogus Cubuk, Xiaojie Shi, Shuyang Cheng, and Dragomir Anguelov. Polyloss: A polynomial expansion perspective of classification loss functions. CoRR, abs/2204.12511, 2022. doi: 10.48550/arXiv.2204.12511. URL https://doi.org/10.48550/arXiv.2204.12511.
Roy et al. [2022] Subhrajit Roy, Diana Mincu, Lev Proleev, Negar Rostamzadeh, Chintan Ghate, Natalie Harris, Christina Chen, Jessica Schrouff, Nenad Tomašev, Fletcher Lee Hartsell, et al. Disability prediction in multiple sclerosis using performance outcome measures and demographic data. In Conference on Health, Inference, and Learning, pages 375–396. PMLR, 2022.
Tomašev et al. [2021] Nenad Tomašev, Natalie Harris, Sebastien Baur, Anne Mottram, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham, Andre Saraiva, Valerio Magliulo, et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nature Protocols, 16(6):2765–2787, 2021.
Roy et al. [2021] Subhrajit Roy, Diana Mincu, Eric Loreaux, Anne Mottram, Ivan Protsyuk, Natalie Harris, Yuan Xue, Jessica Schrouff, Hugh Montgomery, Alistair Connell, et al. Multitask prediction of organ dysfunction in the intensive care unit using sequential subnetwork routing. Journal of the American Medical Informatics Association, 28(9):1936–1946, 2021.
Müller et al. [2019] Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4696–4705, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/f1748d6b0fd9d439f71450117eba2725-Abstract.html.
Lukasik et al. [2020] Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, and Sanjiv Kumar. Does label smoothing mitigate label noise? In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 6448–6458. PMLR, 2020. URL http://proceedings.mlr.press/v119/lukasik20a.html.
Li et al. [2020] Weizhi Li, Gautam Dasarathy, and Visar Berisha. Regularization via structural label smoothing. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 1453–1463. PMLR, 2020. URL http://proceedings.mlr.press/v108/li20e.html.
Meister et al. [2020] Clara Meister, Elizabeth Salesky, and Ryan Cotterell. Generalized entropy regularization or: There’s nothing special about label smoothing. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6870–6886. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.615. URL https://doi.org/10.18653/v1/2020.acl-main.615.
Lienen and Hüllermeier [2021] Julian Lienen and Eyke Hüllermeier. From label smoothing to label relaxation. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 8583–8591. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17041.
Bellamy et al. [2020] David Bellamy, Leo Celi, and Andrew L Beam. Evaluating progress on machine learning for longitudinal electronic healthcare data. arXiv preprint arXiv:2010.01149, 2020.
Harutyunyan et al. [2019] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.
Johnson et al. [2016] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
Chung et al. [2014] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. URL http://arxiv.org/abs/1412.3555.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Saito and Rehmsmeier [2015] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015.
Student [1908] Student. The probable error of a mean. Biometrika, pages 1–25, 1908.
Rath and Hughes [2022] Preetish Rath and Michael Hughes. Optimizing early warning classifiers to control false alarms via a minimum precision constraint. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 4895–4914. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/rath22a.html.
Yeung et al. [2022] Michael Yeung, Evis Sala, Carola-Bibiane Schönlieb, and Leonardo Rundo. Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Computerized Medical Imaging and Graphics, 95:102026, 2022.
Richards [1959] FJ Richards. A flexible growth function for empirical use. Journal of experimental Botany, 10(2):290–301, 1959.
Collett [2015] David Collett. Modelling survival data in medical research. CRC press, 2015.
Jarrett et al. [2019] Daniel Jarrett, Jinsung Yoon, and Mihaela van der Schaar. Dynamic prediction in clinical survival analysis using temporal convolutional networks. IEEE journal of biomedical and health informatics, 24(2):424–436, 2019.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
NVIDIA et al. [2020] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit.
Chetlur et al. [2014] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
gin-config Team [2019] The gin-config Team. gin-config python packaged. https://github.com/google/gin-config, 2019.

Appendix A Theoretical details

a.1 Multi-Horizon prediction: proof of Proposition 1

Equivalency between MHP and TLS objectives.

Recalling the formalism of multi-horizon prediction outlined in Section 3.3, true labels and model predictions can be rewritten as $y_{i, t} = [y_{i, t}^{h_{1}}, \dots, y_{i, t}^{h}, \dots, y_{i, t}^{h_{H}}]$ and ${^y}_{i, t} = [{^y}_{i, t}^{h_{1}}, \dots, {^y}_{i, t}^{h}, \dots, {^y}_{i, t}^{h_{H}}]$ , where $H$ is the number of horizons considered. The training objective for patient $i$ becomes:

L^{M H P} (y_{i, t}, {^y}_{i, t}) = - \frac{1}{H} H \sum k = 1 y_{i, t}^{h_{k}} log ({^y}_{i, t}^{h_{k}}) + (1 - y_{i, t}^{h_{k}}) log (1 - {^y}_{i, t}^{h_{k}})

The assumption that ${^yhki,t}k$ is equal for all $k$ allows to rewrite the objective as follows:

L^{M H P} (y_{i, t}, {^y}_{i, t}) = - [log ({^y}_{i, t}) \frac{1}{H} H \sum k = 1 y_{i, t}^{h_{k}} + log (1 - {^y}_{i, t}) \frac{1}{H} H \sum k = 1 (1 - y_{i, t}^{h_{k}})]

with ${^y}_{i, t}$ being the common prediction shared across all horizons. This equation can now be viewed as a temporal label smoothing objective with smoothed labels $q^{s t e p} (1 | i, t) = \frac{1}{H} \sum_{k = 1}^{H} y_{i, t}^{h_{k}}$ :

L^{M H P} (y_{i, t}, {^y}_{i, t}) = - [log ({^y}_{i, t}) \cdot q^{s t e p} (1 | i, t) + log (1 - {^y}_{i, t}) \cdot (1 - q^{s t e p} (1 | i, t))]

Smoothing parametrization.

Next, we aim to recover the explicit form of $q^{s t e p} (1 | i, t)$ . Without loss of generality, we assume that horizons ${h_{k}}_{k}$ are in ascending order. The temporal dependency between samples, formalized in Equation 2), results in the following relationship between predictions at horizons $h_{u}$ and $h_{v}$ :

	$v \leq u and y_{i, t}^{h_{v}} = 1 ⟹ y_{i, t}^{h_{u}} = 1$		(7)
	$v \geq u and y_{i, t}^{h_{v}} = 0 ⟹ y_{i, t}^{h_{u}} = 0$		(8)

Thanks to the above property, we can determine $q^{s t e p} (1 | i, t)$ by studying three cases of multi-horizon labels, illustrated in Figure 8. For notational simplicity, we define $d_{e} (i, t) = t_{e} (i, t) - t$ .

Figure 8: Label values for multi-horizon prediction, and conversion to smoothed labels $q^{s t e p} (1 | t)$ .

Case 1: $d_{e} (i, t) \leq h_{1}$ .
Label definition in Equation 1 implies that $y_{i, t}^{h_{1}} = 1$ if $d_{e} (i, t) \leq h_{1}$ . As $h_{1}$ is the smallest horizon, following Equation 7, we have $y_{i, t}^{h_{c}} = 1, \forall c \in ⟦ 1, H ⟧$ . We can rewrite the objective as:

	$L^{M H P} (y_{i, t}, {^y}_{i, t})$	$= - log ({^y}_{i, t})$
		$= - [q^{s t e p} (1 \| i, t) log ({^y}_{i, t}) + (1 - q^{s t e p} (1 \| i, t)) log (1 - {^y}_{i, t})]$

where $q^{s t e p} (1 | i, t) = 1$ .

Case 2: $d_{e} (i, t) > h_{H}$ .
Similarly, if $d_{e} (i, t) > h_{H}$ , then $y_{i, t}^{h_{H}} = 0$ which implies $y_{i, t}^{h_{c}} = 0, \forall c \in ⟦ 1, H ⟧$ from Equation 8. The objective can be rewritten as:

	$L^{M H P} (y_{i, t}, {^y}_{i, t})$	$= - log (1 - {^y}_{i, t})$
		$= - [q^{s t e p} (1 \| i, t) log ({^y}_{i, t}) + (1 - q^{s t e p} (1 \| i, t)) log (1 - {^y}_{i, t})]$

where $q^{s t e p} (1 | i, t) = 0$ .

Case 3: $\exists k \in ⟦ 1, H - 1 ⟧ s . t h_{k} < d_{e} (t) \leq h_{k + 1}$ .
Following the same reasoning as in the first two cases, we now have a specific index $k$ which separates positive and negative labels. We have $y_{i, t}^{h_{c}} = 0, \forall c \in ⟦ 1, k ⟧$ and $y_{i, t}^{h_{c}} = 1, \forall c \in ⟦ k + 1, H ⟧$ . This allows to rewrite the objective as follows:

	$L^{M H P} (y_{i, t}, {^y}_{i, t})$	$= - [\frac{H - k}{H} log ({^y}_{i, t}) + \frac{k}{H} log (1 - {^y}_{i, t})]$
		$= - [q^{s t e p} (1 \| i, t) log ({^y}_{i, t}) + (1 - q^{s t e p} (1 \| i, t)) log (1 - {^y}_{i, t})]$

where

q^{s t e p} (1 | i, t) = \frac{H - k}{H} .

Defining a new smoothing parametrisation $α^{s t e p}$ such that $q^{s t e p} (1 | i, t) = 1 - α^{s t e p} (i, t)$ , we obtain:

α^{s t e p} (i, t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{k}{H} & i f h_{k} \leq d_{e} (i, t) < h_{k + 1} \forall k \leq H - 1 0 & i f d_{e} (i, t) \leq h_{1} 1 & i f d_{e} (i, t) > h_{H} \end{matrix}

Thus, $\forall d_{e} (t) > 0$ , we find that $L_{i}^{M H P} = L_{i}^{T L S}$ when smoothed labels are defined as $q^{s t e p} (1 | i, t) = 1 - α^{s t e p} (i, t)$ . This concludes our proof. ∎

a.2 Temporal label smoothing functions

Motivated by prior work [Tomašev et al., 2019, Cox, 1972], we compare the performance of various smoothing functions $α (i, t)$ . All proposed parametrizations are continuous and monotonous decreasing functions which satisfy boundary conditions $α (i, t_{e} (i, t) - 2 h) = 1$ and $α (i, t_{e} (i, t)) = 0$ . As evidenced in Table 4, we find exponential label smoothing to perform best or as well as others across all tasks and metrics. Performance as a function of hyperparameter setting can be visualized in Figure 10. All model and hyperparameter selection was carried out on the validation set, including the final choice of parametrization function.

Task	Circulatory Failure		Decompensation		Respiratory Failure
Method	AUPRC	Recall	AUPRC	Recall	AUPRC	Recall
$α^{s t e p}$	39.3 $\pm$ 0.2	29.4 $\pm$ 0.8	35.2 $\pm$ 0.3	29.2 $\pm$ 0.4	60.5 $\pm$ 0.1	$77.4$ $\pm$ 0.5
$α^{s h i f t}$	$40.1 \pm 0.3$	$31.8 \pm 0.6$	34.5 $\pm$ 0.4	28.2 $\pm$ 0.5	$60.5$ $\pm$ 0.2	$77.3$ $\pm$ 0.5
$α^{l i n e a r}$	39.4 $\pm$ 0.3	29.7 $\pm$ 0.8	35.1 $\pm$ 0.4	29.2 $\pm$ 0.6	60.3 $\pm$ 0.3	77.0 $\pm$ 0.6
$α^{s i g m o i d}$	39.4 $\pm$ 0.3	29.7 $\pm$ 0.8	34.9 $\pm$ 0.4	28.8 $\pm$ 0.5	60.6 $\pm$ 0.2	77.3 $\pm$ 0.5
$α^{c o n c a v e}$	39.4 $\pm$ 0.3	29.7 $\pm$ 0.8	35.1 $\pm$ 0.4	29.2 $\pm$ 0.6	60.3 $\pm$ 0.3	77.0 $\pm$ 0.6
$α^{e x p}$	$40.6$ $\pm$ 0.3	$32.3$ $\pm$ 0.7	$35.5$ $\pm$ 0.3	$29.3$ $\pm$ 0.4	$60.4$ $\pm$ 0.2	$77.0$ $\pm$ 0.3

Table 4: Performance of different smoothing functions on early prediction tasks. Recall is reported at a 50% precision.

Shifted boundary labels.

Shifting the prediction horizon or label boundary in training can be viewed as a form of temporal label smoothing, in which class labels are inverted within a prediction window of interest. This defines the following smoothing parametrization $α^{s h i f t} (i, t)$ :

α^{s h i f t} (i, t) = 1 [t_{e} (i, t) - t \geq h_{s h i f t}]

(9)

where $h_{s h i f t}$ is a hyperparameter controlling the horizon of the smoothed labels ( $h_{s h i f t} = h$ corresponds to cross-entropy training). The strength of this smoothing function is illustrated Figure 8(a).

Figure 10 outlines the performance of this alternative smoothing parametrization as a function of $h_{s h i f t}$ . For both decompensation and respiratory failure, shifting the label boundary closer to the event time decreases performance. On circulatory failure, performance does improve over traditional cross-entropy training as the label horizon is brought closer to the event of interest, which can be interpreted as an inductive bias similar to that induced by the exponential smoothing function.

Linear label smoothing.

The most straightforward extension to the step function $α^{s t e p}$ described in Section 3.3 is a linear label smoothing corresponding to the case $H \to + \infty$ .
Our parametrization $α^{l i n e a r} (i, t)$ is thus defined as follows:

α^{l i n e a r} (i, t) = {\begin{matrix} \frac{t_{e} (i, t) - t}{2 h} & i f t_{e} (i, t) - t < 2 h 1 & i f t_{e} (i, t) - t \geq 2 h \end{matrix}

(10)

We illustrate the impact of the number of steps $H$ in Figure 8(b).

Sigmoidal label smoothing.

Another natural direction to explore is to smooth labels starting from the true distribution, a unique step function at $t = t_{e} (t) - h$ . This can be achieved by defining $α (t)$ as a generalized logistic function [Richards, 1959]:

α^{s i g m o i d} (i, t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} 1 - \frac{K - A}{1 + e^{\frac{t_{e} (i, t) - t - d}{γ}}} - A & if t_{e} (i, t) - t < 2 h 1 & if t_{e} (i, t) - t \geq 2 h \end{matrix}

(11)

where $K$ , $A$ and $d$ are three constants fixed by imposing the boundary conditions at $t = t_{e} (i, t) - 2 h$ and $t = t_{e} (i, t)$ , as well as $α (t_{e} (i, t) - 2 h) = \frac{1}{2}$ . This yields:

	$K$	$= - A e^{\frac{2 h - d}{γ}}$
	$A$	$= \frac{e^{\frac{- d}{γ}} + 1}{e^{\frac{- d}{γ}} - e^{\frac{2 h - d}{γ}}}$
	$d$	$= h$

As shown in Figure 8(d), $γ$ controls the smoothing strength, interpolating between the true distribution $δ_{y_{i} = 1}$ as $γ \to 0$ and $q^{l i n e a r}$ when $γ \to + \infty$ .

Exponential label smoothing.

The smoothing function we find to perform best is an exponential decay. This idea is motivated by survival analysis, where patient survival probability can be modeled as the exponential decay of a cumulative hazard function Cox [1972], Collett [2015]. In practice, as defined in Section 3.2, our exponential smoothing function $α^{e x p} (i, t)$ is defined as follows:

α^{e x p} (i, t) = {\begin{matrix} 1 - e^{- γ (t_{e} (i, t) - t - d)} - A & if t_{e} (i, t) - t < 2 h 1 & if t_{e} (i, t) - t \geq 2 h \end{matrix}

(12)

where parameters ${d, A}$ are set to satisfy boundary conditions:

	$A$	$= - e^{- γ (2 h - d)}$
	$d$	$= - \frac{1}{γ} ln (1 - e^{- γ 2 h})$

Here, $γ$ also controls the smoothing strength between $q^{l i n e a r}$ when $γ \to 0$ and $q (t) = 0 \forall t < t_{e}$ when $γ \to + \infty$ .

Overall, despite $α^{s i g m o i d}$ and $α^{s h i f t}$ achieving good results on respiratory and circulatory failure respectively, $α^{e x p}$ statistically outperforms these smoothing parametrizations across all tasks on validation metrics. An interesting avenue for further work would be to combine exponential smoothing with the boundary shift approach, or effectively changing $(h_{m i n}, h_{m a x})$ , which was fixed to $(0, 2 h)$ in our work for fair comparison to multi-horizon prediction.

Concave exponential label smoothing.

Finally, to mirror the behavior of the exponential smoothing function away from linear interpolation and investigate its effect on performance, we designed the following concave smoothing function $α^{c o n c a v e}$ :

αconcave(i,t)={\pare−γ(d−te(i,t)+t)−Aif% te(i,t)−t<2h1if te(i,t)−t≥2h

(13)

Parameters ${d, A}$ are identical to the convex smoothing function parameters, set to satisfy boundary conditions. The strength of this concave smoothing function is illustrated Figure 8(c).

No performance gains were obtained through temporal label smoothing with a concave function, as shown in Figure 10. This smoothing function effectively penalizes false positives harder than false negatives, which is less adapted to our tasks of interest (in contrast to the convex $a^{e x p}$ ). As a result, the best-performing concave parametrization is consistently obtained with the lowest value of $γ$ , closer to a linear function choice.

Comparison to survival analysis.

Survival analysis consists of statistical methods concerned with predicting the probability of a certain event taking place over time [Collett, 2015]. In our formalism outlined in Section 3.1, the corresponding task is to regress the time of the next event, $t_{e} (t, i)$ , based on patient information accumulated up to time $t$ . To recover early event prediction, a threshold on the hazard model can thus be applied to determine whether an event will happen within our horizon of interest $h$ . Modelling constraints imposed in survival analysis improve time-to-event prediction performance over traditional regression methods, which supports our approach to leverage the temporal structure of our comparable task. Interestingly, recent developments in survival modelling to deal with dynamic predictions have been addressed with multi-horizon prediction [Jarrett et al., 2019].

Still, distinctions must be highlighted between our adverse event prediction problem and the typical experimental setup for survival analysis: in our case, multiple events can occur over the course of a patient’s stay, with unknown patient states during and immediately after event occurrence. This results in complex, informative censoring patterns and challenges common assumptions in survival analysis, which can therefore not be directly applied to our task.

a.3 Baseline objective functions

In this section, we clarify the mathematical formalism behind our baselines to facilitate comparison to temporal label smoothing. All baselines explored effectively propose a modification of the cross-entropy objective often used for binary classification tasks, $L_{i} = L^{C E} (y_{i}, {^y}_{i})$ .

Balanced cross-entropy.

To facilitate learning from highly imbalanced datasets, balanced cross-entropy relies on reweighting samples based on their class prevalence, as follows:

L^{C E} = \frac{1}{N} N \sum i ω_{y_{i}} L ({^y}_{i}, y_{i})

(14)

where $C$ is the number of classes, $ω_{y_{i}} = \frac{1}{C \cdot b (y_{i})}$ and $b (c)$ defines the prevalence of class $c$ such that $\sum_{c} b (c) = 1$ . Regular cross-entropy corresponds to the case where $b (c) = \frac{1}{C}$ for all classes. In the binary setting, $b (1)$ can be treated as a hyperparameter determining the contribution of the minority class to the loss.

Focal loss.

Denoting our output prediction as ${^y}_{i} = p_{θ} (y_{i} = 1)$ , the focal loss objective for binary classification of target $y_{i}$ is a variant on the balanced cross-entropy loss:

L^{f o c a l} (y_{i}, {^y}_{i}) = - ω_{1} (1 - {^y}_{i})^{ζ} y_{i} log ({^y}_{i}) - ω_{0} {^y}_{i}^{ζ} (1 - y_{i}) log (1 - {^y}_{i})

where $ω_{y_{i}}$ is a balancing weight for class $y_{i}$ and $ζ$ is the focal loss weight.