Oreum Industries Internal Project, 2024Q3
000_Overview.ipynb¶
Oreum Case Studies - Lung Cancer Survival Regression oreum_cs_lung
¶
Demonstrate Survival Regression Modelling using Bayesian inference and a
Bayesian workflow, specifically using the pymc
& arviz
ecosystem.
Here we report a brief overview of a full case study
oreum_cs_lung
in which we demonstrate an E2E workflow for survival regression models of increasing sophistication.
This overview is for quick discussion purposes only, and ideally should accompany a deeper technical walkthrough of the case study in a long-form style. There we evaluate the behaviour and performance of the models throughout the workflows, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models.
Oreum Industries: Technical Resources
We use a complicated, real-world dataset: the NCCTG Lung Cancer Dataset
lung
from the
survival
R package via statsmodels
.
This complicated dataset requires multiple advanced modelling methods to
handle and mitigate bad / messy data.
This case study also inspired us to create two original novel contributions to
the pymc
project:
- A detailed worked example on handling ordinals in
pymc-examples
in GLM-ordinal-features - A detailed worked example on handling missing data in
pymc-examples
in GLM-missing-values-in-covariates
See comprehensive explanations and deep technical demonstrations of various survival models in our public project Oreum Survival including Accelerated Failure Time and Piecewise Regression Models
Preamble¶
What is Survival Regression?¶
See comprehensive explanations and deep technical demonstrations in our public project Oreum Survival
Essentially, we seek to create principled models that provide explanatory inference and predictions of the Survival Function ^S(t) and Expected Time-to-Event ^Et with quantified uncertainty to support real-world decision-making.
Illustrative Survival Function S(t) and Expected Time-to-Event Et¶
Taken from notebook oreum_cs_lung
300_ModelA_E2E.ipynb
and shown here just to give the reader a feel for the outputs and the
problem-space. All estimates are made with quantified uncertainty, which reveals (rather than hides) the inherent noise
in real-world data and processes.
We gain massive advantage by using a Bayesian Framework¶
We specifically use Bayesian Inference rather than Frequentist Max-Likelihood methods for many reasons, including:
Bayesian Inference | Frequentist Max-Likelihood | |
---|---|---|
General formulation → Desirable Trait ↓ |
Bayes' Rule P(^H∥D)posterior=likelihoodP(D∥H)⋅priorP(H)P(D)evidence | MLE ^HMLE∝argmaxHP(D∥H) |
Principled model structure represents hypothesis about the data-generating process | Very strong Can build bespoke arbitrary and hierarchical structures of parameters to map to the real-world data-generating process. |
Weak Can only state structure under strict limited assumptions of model statistical validity. |
Model parameters and their initial values represent domain expert knowledge | Very strong Marginal prior distributions represent real-world probability of parameter values before any data is seen. |
Very weak No concept of priors. Lack of joint probability distribution can lead to discontinuities in parameter values. |
Robust parameter fitting process | Strong Estimate full joint posterior probability mass distribution for parameters - more stable and representative of the expectation for the parameter values. Sampling can be a computationally expensive process. |
Weak Estimate single-point max-aposterioi-likelihood (density) of parameters - this can be far outside the probability mass and so is prone to overfitting and only correct in the limit of infinite data. But optimization method can be computationally cheap. |
Fitted parameters have meaningful summary statistics for inference | Very strong Full marginal probability distributions can be interpreted exactly as probabilities. |
Weak Point estimates only have meaningful summary statistics under strict limited assumptions of model statistical validity. |
continues ...
... continued
Desirable Trait | Bayesian Inference | Frequentist Max-Likelihood |
---|---|---|
Robust model evaluation process | Strong Use entire dataset, evaluate via Leave-One-Out Cross Validation (best theoretically possible). |
Weak Cross-validation rarely seen in practice, even if used, rarely better than 5-fold CV. Simplistic method can be computationally cheap. |
Predictions made with quantified variance | Very strong Predictions made using full posterior probability distributions, so predictions have full empirical probability distributions. |
Weak Predictions using point estimates can be bootstrapped, but predictions only have interpretation under strict limited assumptions of model validity. |
Handle imbalanced, high cardinality & hierarchical factor features | Very strong Can introduce partial-pooling to automatically balance factors through hierarchical priors. |
Weak Difficult to introduce partial-pooling (aka mixed random effects) without affecting strict limited assumptions of model validity. |
Handle skewed / multimodal / extreme value target variable | Very strong Represent the model likelihood as any arbitrary probability distribution, including mixture (compound) functions e.g. a zero-inflated Weibull. |
Weak Represent model likelihood with a usually very limited set of distributions. Very difficult to create mixture compound functions. |
Handle small datasets | Very strong Bayesian concept assumes that there is a probable range of values for each parameter, and that we evidence our prior on any amount of data (even very small counts). |
Very weak Frequentist concept assumes that there is a single true value for each parameter and that we only discover that value in the limit (of infinite observations). |
Automatically impute missing data | Very strong Establish a prior for each datapoint, evidence on the available data within the context of the model, to automatically impute missing values. |
Very weak No inherent method. Usually impute as a pre-processing step with weak non-modelled methods. |
Practical Implementations of Bayesian Inference¶
We briefly referenced Bayes Rule above, which is a useful mnemonic when discussing Bayesian Inference, but in practice the crux of putting these advanced statistical techniques into practice is estimating the evidence P(D) i.e. the probability of observing the data that we use to evidence the model
P(^H|D)posterior=likelihoodP(D|H)⋅priorP(H)P(D)evidence...where:P(D)∼∫ΘP(D,θ) dθ
This joint probability P(D,θ) of data D and parameters θ requires an almost impossible-to-solve integral over parameter-space Θ. Rather than attempt to calculate that integral, we do something that sounds far more difficult, but given modern computing capabilities is actually practical.
We use a Bleeding-edge MCMC Toolkit for Bayesian Inference: pymc
& arviz
¶
We use Markov Chain Monte-Carlo (MCMC) sampling to take a series of ergodic, partly-reversible, partly-randomised samples of model parameters θ, and at each step compute the ratio of log-likelihoods logP(D|H) between a starting position (current values) θp0 and proposed "sampled" position θp in parameter space, so as to reduce that log-likelihood (whilst exploring the parameter space).
This results in a posterior estimate P(^θ|D):
P(^θ|D)∼likelihood @ proposalP(D|θp)⋅prior @ proposalP(θp)P(D|θp0)likelihood @ current⋅P(θp0)prior @ current
This is the heart of MCMC sampling: for detailed practical explanations see Betancourt, 2021 and Tweicki, 2015
We use the bleeding-edge pymc
and arviz
Python packages to provide the full Bayesian toolkit that we require, including advanced sampling, probabilistic programming,
statistical inferences, model evaluation and comparison, and more.
1. The Dataset and Problem-Space¶
1.1 Dataset¶
For illustration: table of the dataset (post-cleaning)¶
See oreum_cs_lung
100_CurateExtractClean.ipynb
for details
duration | death | age_at_start | calories_consumed_amt | ecog_physician_cat | inst_id | karno_patient_amt | karno_physician_amt | male | weight_loss_prior_amt | |
---|---|---|---|---|---|---|---|---|---|---|
pid | ||||||||||
p000 | 306 | True | 74 | 1175.0 | c1 | i3 | 100.0 | 90.0 | True | NaN |
p001 | 455 | True | 68 | 1225.0 | c0 | i3 | 90.0 | 90.0 | True | 15.0 |
p002 | 1010 | False | 56 | NaN | c0 | i3 | 90.0 | 90.0 | True | 15.0 |
p225 | 105 | False | 75 | 1025.0 | c2 | i32 | 70.0 | 60.0 | False | 5.0 |
p226 | 174 | False | 66 | 1075.0 | c1 | i6 | 100.0 | 90.0 | True | 1.0 |
p227 | 177 | False | 58 | 1060.0 | c1 | i22 | 90.0 | 80.0 | False | 0.0 |
'Shape: (228, 10), Memsize 0.0 MB'
Explanation of the features for each of the 228 observations, post-cleaning (which includes cleaning dirty values forcing correct datatypes, renaming to be more clear, etc)
Endogenous (target) features captured at the end of the study period
duration: Time-to-event observed during study period (days) (can be censored)
death: Death observed during study period
Exogenous (predictor) features captured at the start of the study period
age_at_start: Age in years
calories_consumed_count: Calories consumed at meals
ecog_physician_cat: ECOG score as rated by the physician. c0 to c4
inst_id: Institution identifier code
karno_patient_amt: Karnofsky score as rated by patient {0, ..., 100}
karno_physician_amt: Karnofsky score as rated by physician {0, ..., 100}
male: Sex of patient is male (True/False)
weight_loss_prior_amt: Weight loss in last six months (pounds)
The medical details behind this data are explained and linked further in the dataset page
1.2. Time to Event t and Event Density π¶
The 228 individuals have an observed "lifetime" aka time-to-event during the study, measured in days.
For each individual this starts when they enter the study (not necessarily all on the same day), and
ends either at death (death=True
) or when the study ends.
Lifetimes (Censored Time-to-Event)¶
Empirical event density π¶
1.3 Modelled Survival Function ^S(t) and Expected Time-to-Event ^Et¶
We seek to estimate π and thus the Expected Time-to-Event ^Et
As a side-effect of the mathematical relationships, we also get a Survival Function ^S(t) which can be a more familiar representation for newcomers to understand.
The model produces these estimates with quantified uncertainty per individual, but we can substitute a simplified dataset to effectively roll these up to look at the differences between groups according to exogenous feature.
For example, forecasted Expected Time-to-Event ^Et and Survival Function ^S(t) for male vs female:¶
2. The Modelling Work¶
2.1 Architectures¶
We use an Accelerated Failure Time (AFT) model, specifically a censored Weibull-distribution on the Event Density π.
The AFT base model is explained in great detail with examples in our public reference project
oreum_survival
001_BayesianSurvival_AFT.ipynb
As noted above, the purpose of this case study is to demonstrate an E2E workflow for models of increasing sophistication. We evaluate the behaviour and performance of the models throughout the workflows, including several state-of-the-art methods unavailable to conventional max-likelihood / machine learning models.
For illustration: math and plate notation diagram of the most basic model (Model A0
)¶
See oreum_cs_lung
300_ModelA_E2E.ipynb
for details and the full workflow. See oreum_cs_lung
src/model/lung.py
for the code
Snippet of the general math for the censored Weibull likelihood and the linear-submodel for regression onto features:
σβ∼Gamma(α=1,β=2)β∼Normal(μ=0,σ=σβ)α∼exp(βTx)γ∼Gamma(α=1,β=200)^π(t)∼αγ(γti)α−1, di⋅exp(−(γt)α)∼Weibull(t | α,γ)
Plate notation of simplest model ModelA0
¶
2.2 Bayesian Workflow¶
Throughout the case study we employ at reasonably complete Bayesian Workflow (see Gelman et al, 2020 and Betancourt, 2020) using a cyclical process of model evaluation and improvement.
2.3 The Model Variants and Technical Evaluation¶
Generally, we want to use as many observations and features from the dataset as possible, although obviously we have to mitigate overfitting (we use L2 regularization) and resource constraints.
In general we aim to create models with increasing sophistication and ability to handle the nature of the dataset. In particular in this dataset we encounter numerics, booleans, categoricals, ordinals and missing values!
In this particular case study we ended up with 5 different model variants:
Model A
¶
- Establish the core model architecture for Accelerated-Failure-Time (AFT) including:
- Weibull likelihood on event density
- Event censoring via CustomDist
- Linear submodel on α param
- In particular create 3 sub-variants using the (complete) numeric and boolean values available:
male_t_true
age_at_start
- Linear component:
1 + male_t_true + age_at_start + age_at_start^2
Model B
¶
- Extend
Model A
to include (missing-value / incomplete) numeric values - We use a sophisticated technique of hierarchical auto-imputation to fill-in
missing values of
calories_consumed_amt
,weight_loss_prior_amt
within the model itself (not a pre-processing step!) - Linear component:
1 + male_t_true + age_at_start + age_at_start^2 + calories_consumed_amt + weight_loss_prior_amt
Model C
¶
- Extend
Model B
to include (complete) categorical values - We use a sophisticated technique with a hierarchical prior (aka mixed random effects)
to include the categorical feature
inst_id
- Interestingly, we see the model performance decrease, likely because the cardinality
is too high given the small dataset, and in the real-world this would encourage us
to seek more information from the data source to learn latent groupings within the
inst_id
and apply further hierarchies into the model - Linear component:
0 + inst_id + male_t_true + age_at_start + age_at_start^2 + calories_consumed_amt + weight_loss_prior_amt
Model D
¶
- Extend
Model C
to include (complete) ordinal values - We use a sophisticated technique with Dirichlet allocator to include ordinal categorical features
ecog_physician_cat
,karno_physician_cat
,karno_patient_cat
- These are present due to misuse of a subjective value that's been mapped to a metric scale, also see
pymc-examples
GLM-ordinal-features.ipynb
for detailed discussion and worked examples - Linear component:
0 + inst_id + ecog_physician_cat + karno_physician_cat + karno_patient_cat + 'male + age_at_start + np.power(age_at_start,2) + calories_consumed_amt + weight_loss_prior_amt + death
Model E
¶
- Simplify
Model C
to removeinst_id
, to make use of the innovations inModel D
and remove the performance drop ofModel C
- We compare the models
Model A0
vsModel E
on an in-sample dataset to eyeball the results
For illustration: plate notation diagram of the most complicated model in this case study (Model D
)¶
See oreum_cs_lung
303_ModelD_E2E.ipynb
for details and the full workflow, and note we actually pull this back
to a simpler Model E which achieves the best results
In-sample Model Evaluation¶
We create a variety of models according to the nature of the data and compare their performance using advanced statistical reasoning.
At the end we can make a quantified evaluation and see that Model E performs the best for this dataset.
See oreum_cs_lung
304_ModelE_E2E.ipynb
for details of this in-sample LOO-PIT evaluation method
3. Using the Model Outputs¶
3.1 Inference via the linear coefficients¶
We used a 2-parameter form of the Weibull with a linear regression onto α∼exp(βTx).
When:
- α<1, hazard decreases over time e.g. early mortality
- α=1, hazard is constant λ∼γ and the model reduces to the Exponential model
- α>1, hazard increases over time e.g. later mortality
So we can view the relative coefficient values to make inferences about the correlation (not causation) of features with changes in α
3.1.1 Interpret effect of Simple Linear Coefficients¶
First let's view the posterior values of beta_*
to infer how the numeric values affect α
Observe:
beta: intercept
: E∼0.053, 0∈HDI94, mild effect α>1 i.e. overall increasing hazard fn λ over duration (later mortality)beta: male_t_true
: E∼−0.15, 0∈HDI94,, 0∉HDI80, appreciable effect, earlier mortality for malesbeta: age_at_start
: E∼−0.053, 0∈HDI94, minimal effect, slightly earlier mortality for higher agesbeta: age_at_start_2_power
: E∼−0.12, 0∈HDI94, mild effect, earlier mortality for higher agesbeta_mv: calories_consumed_amt
: E∼0.21, 0∈HDI94, 0∉HDI80, stronger effect, later mortality with higher caloriesbeta_mv: weight_loss_prior_amt
: E∼0.015, 0∈HDI94 minimal effect, slightly later mortality with higher caloriesbeta_ecog_physician_cat
: E∼−0.012, 0∈HDI94 minimal effect, slightly earlier mortality with higher ecog physician scorebeta_karno_physician_cat
: E∼0.23, 0∈HDI94, appreciable effect, later mortality with higher karno physician scorebeta_karno_patient_cat
: E∼0.17, 0∈HDI94, appreciable effect, later mortality with higher karno patient score
We won't attempt to interpret the coefficients much further because we're not physicians and this isn't our data / medical study.
However, we can note that beta: male_t_true
, beta: age_at_start_2_power
, beta_mv: calories_consumed_amt
, beta_karno_physician_cat
,
beta_karno_patient_cat
all have appreciable correlating effects with mortality in this model - and would be good candidate features for
further exploration, model finessing, and further medical study
Let's view those beta
coefficients in a forestplot to gain a better understanding of the relative effects we just noted
Observe:
This is the same info we saw above, but let's highlight the biggest movers:
beta: male_t_true
: E∼−0.15, 0∈HDI94,, 0∉HDI80, appreciable effect, earlier mortality for malesbeta: age_at_start_2_power
: E∼−0.12, 0∈HDI94, mild effect, earlier mortality for higher agesbeta_karno_physician_cat
: E∼0.23, 0∈HDI94, appreciable effect, later mortality with higher karno physician scorebeta_karno_patient_cat
: E∼0.17, 0∈HDI94, appreciable effect, later mortality with higher karno patient score
3.1.2 Interpret effects of ordinal values¶
We can also view the posterior values of nu_*
to infer how the ordinal categorical values in
beta_ecog_physician_cat
, beta_karno_physician_cat
, beta_karno_patient_cat
affect α
Observe:
nu_ecog_physician_cat
: asc0 -> c4
there's barely any impact, except for widening uncertainty. As we see in100_Curate_ExtractClean.ipynb
§3.2, the upper valuesc3
&c4
are almost not observed, so this huge uncertainty is expected, however, there's no appreciable difference withinc0
,c1
,c2
suggesting it's not a worthwhile metric for our model usagenu_karno_physician_cat
: asc0 -> c100
there's a fairly linear increase on\alpha
(later mortality). The near-linear response suggests this ordinal metric is being used quite reliablynu_karno_patient_cat
: asc0 -> c100
there's a fairly linear increase on\alpha
(later mortality). The near-linear response suggests this ordinal metric is being used quite reliably, but the effect is weaker thannu_karno_physician_cat
which makes sense if we assume that the physician is a better judge of health
For more detail and discussion on ordinals, see our original novel contribution to pymc-examples
in
GLM-ordinal-features
3.1.3 Interpret auto-imputed missing values¶
We can also view the posterior values of x_mv_mu *
to infer how the model has auto-imputed hierarchical expected
values for the missing data in features calories_consumed_amt
and weight_loss_prior_amt
.
For brevity we wont look into the auto-imputed posterior values x_mv
for individual observations, and refer the reader
to 304_ModelE_E2E.ipynb
for full detail
Observe:
calories_consumed_amt
: E∼0.0034, 0∈HDI94weight_loss_prior_amt
: E∼−0.0004, 0∈HDI94- In both cases with zscored the data prior to modelling, so these hierarchical values effectively
0
are exactly what we would expect to see if the missing values are Missing at Random (MAR). This supports our choice to include these features without fear that we are introducing systemic issues in the data
For more deatil and discussion on handling missing data, see our original novel contribution to pymc-examples
in
GLM-missing-values-in-covariates
3.2 Prediction of Expected Time-to-Event ^Et for individual observations¶
We also engineered our model to be able to make predictions on out-of-sample (aka previously unseen) data of Expected Time-to-Event ^Et (which is formally in the model as ^π(t)
This is not a technical evaluation method, and the proper technical model evaluation is discussed above §2.3, but here we can eyeball the predictions on a subset of data vs the true values, and get a feel for what we can do with the predictive outputs.
Each prediction is of course a distribution over a range of values, quantifying the uncertainty in the prediction. We can use the mean as the expected value, and the distribution as a measure of the uncertainty. We can test this distribution against the true data and comment on the model calibration.
Assuming a well-calibrated model, we can use the distribution as an exceedance curve wherein we choose to use the predicted value at e.g. the 90th percentile, so that e.g. 90% of the time the true value is below our predicted value. This is critical in risk evaluation etc.
Observe:
- This is a random subset of the full dataset, which itself is very small and noisy, so we shouldn't expect perfection
- Nonetheless, we see useful predictions where the HDI94 has always captured the true value, and in most cases the much smaller HDI50 has captured the correct value
- The ability of our principled model to make useful predictions on out-of-sample data is incredibly powerful
Next Steps¶
Now the interested reader should dig into the full case study Notebooks in project
oreum_cs_lung
There we demonstrate the full E2E workflow for models of increasing sophistication, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models.
100_Curate_ExtractClean.ipynb
200_EDA_Survival.ipynb
300_ModelA_E2E.ipynb
301_ModelB_E2E.ipynb
302_ModelC_E2E.ipynb
303_ModelD_E2E.ipynb
304_ModelE_E2E.ipynb
Oreum Industries © 2024