Oreum Industries Reference Project, 2024Q3
000_Intro.ipynb¶
Oreum Reference - Copula Regression oreum_copula
¶
Demonstrate Bayesian Copula Regression Modelling using Bayesian inference and a Bayesian workflow, specifically using
the pymc
& arviz
ecosystem.
This Intro can also be used for verbal presentation and discussion purposes, ideally followed by a deeper technical walkthrough of the project in a long-form style. Because this project is a reference, it contains huge amounts of detail which is not worthwhile to summarise too much.
The interested reader should refer to the project notebooks where we evaluate the behaviour and performance of the models throughout the workflows, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models.
Oreum Industries: Technical Resources
What is Copula Regression?¶
We seek to create principled models that provide explanatory inference and predictions of Marginal distributions $M$ that are jointly coupled by a Latent Copula $C$, using quantified uncertainty to support real-world decision-making.
Motivation:
- A classic use-case for this model architecture (in the 2-dimensional setting) is insurance claims aka incurred loss
- We decompose the dollar value of claims into two marginal distributions: the
frequency
, andseverity
ofexpected loss cost
, because these measures are intuitive and can behave differently, with a (highly important) degree of covariance $\Sigma$ - If we use a naive model that doesn't account for the covariance between
frequency
andseverity
, then the model predictions forexpected loss cost
can be hugely wrong!
Quick Aside on decomposition of claims frequency
and severity
¶
We can create different decompositions for different purposes, and according to the data available. A useful one shown here is the ratio of losses per unit TIV, to generalise to policies of different TIV
$$ \begin{aligned} frq_{i} &= \frac{claim\_ct_{i}}{TIV_{i}} \ , & sev_{i} &= \frac{incurred\_total_{i}}{claim\_ct_{i}} \\ \\ \mathbb{E}_{\text{loss} \ i} &= frq_{i} * sev_{i} = \frac{incurred\_total_{i}}{TIV_{i}} \\ \end{aligned} $$
where:
- Each policy $i \in n$ (the dataset of all policies) can have it's own (policy-level) frequency ($frq_{i} \geq 0$) and severity ($sev_{i} \geq 0$) of claim (and thus policy-level $\mathbb{E}_{\text{loss i}} \geq 0$)
- Note $frq$ and $sev$ tend to be zero-augmented distributions (where no loss is experienced): this is a very important aspect to include in real-world model architectures
- $claim\_ct_{i} \geq 0$ is the count of claims incurred for policy $i$
- $TIV_{i} \gt 0$ is the Total Insured Value (TIV) for policy $i$
- $incurred\_total_{i} \geq 0$ is the total incurred losses for policy $i$
Plug:
Oreum Industries is releasing a forecasting product Vulcan for DUA pricing & portfolio management that takes 3 simple tables (policy, premium_transactions, claims_transactions) and outputs probabilistic forecasts of Expected Loss Cost at policy-level
At the heart of the solution is a holistic, principled, open-box Bayesian copula model (for covarying claims frq-sev) that runs quickly via API to integrate within modern data processing pipelines and dashboards
We incorporate underwriter expertise; create hierarchical generalised linear regression on business-relevant parameters with high cardinality; allow for zero-claims inflation; auto-impute missing data; and more generally allow for small noisy datasets typical of DUA business.
Back to this presentation's focus on the copula function¶
Demonstration:
- In this notebook:
- We create a small synthetic dataset of observations of two marginals $M_{0}, M_{1}$ which have covariance $\Sigma$, and also (because we can) a version of the marginals $M_{0x}, M_{1x}$ without covariance
- We compare the resulting values of the joint product $y = M_{0} * M_{1}$ vs $y = M_{0x} * M_{1x}$ and see that impact of ignoring the covariance is substantial.
- In the rest of the reference guide:
- We create a series of principled copula models using advanced architectures and Bayesian inference to fit to the data and estimate the covariance on $M_{0}, M_{1}$
- The first model is naive and ignores the covariance, the final model is very sophisticated and estimates the covariance
- We demonstrate a substantial 33 percentage-point improvement in model accuracy when using a copula-based model
- This correct estimation would likely make the difference between profitable pricing / accurate reserving, or greatly loss-making business over a portfolio.
General project approach
The emphasis in this project is to build a variety of models of increasing sophistication and demonstrate their usage.
We strike a balance between building up concepts & methods vs practical application & worked examples in a pymc
-based
Bayesian workflow.
We don't focus on specific analysis of the dataset, nor try to infer too much. The dataset is simply a good substrate on which to learn and demonstrate the variety of model architectures used herein.
We evaluate the behaviour and performance of the models throughout the workflows, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models
This series of Notebooks covers
000_Intro.ipynb
: Orientiation and fundamental concepts100_ModelA0.ipynb
: Core (naive) architecture: Create priors, marginal likelihoods, but no copula101_ModelA1.ipynb
: Partial architecture (extends ModelA0): Include Gaussian copula (w/ Jacobian adjustment), and several technical innovations to letpymc
work with the transformations102_ModelA2.ipynb
: Full architecture (extends ModelA1): Include Jacobian Adjustment on transformed observations
In this Notebook
We dive straight into Orientation and Fundamental General Abstractions with a simple real-world observational censored dataset, and then go on to demonstrate the theory and usage of an increasing sophistication of models.
Contents¶
Preamble: Why Bayes?¶
We gain massive advantage by using a Bayesian Framework¶
We specifically use Bayesian Inference rather than Frequentist Max-Likelihood methods for many reasons, including:
Bayesian Inference | Frequentist Max-Likelihood | |
---|---|---|
General formulation $\rightarrow$ Desirable Trait $\downarrow$ |
Bayes' Rule $$\underbrace{P(\hat{\mathcal{H}}\|D)}_{\text{posterior}} = \frac{\overbrace{P(D\|\mathcal{H})}^{\text{likelihood}} \cdot \overbrace{P(\mathcal{H})}^{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}} $$ | MLE $$\hat{\mathcal{H}}^{\text{MLE}} \propto \arg\max_{\mathcal{H}} P(D\|\mathcal{H})$$ |
Principled model structure represents hypothesis about the data-generating process | Very strong Can build bespoke arbitrary and hierarchical structures of parameters to map to the real-world data-generating process. |
Weak Can only state structure under strict limited assumptions of model statistical validity. |
Model parameters and their initial values represent domain expert knowledge | Very strong Marginal prior distributions represent real-world probability of parameter values before any data is seen. |
Very weak No concept of priors. Lack of joint probability distribution can lead to discontinuities in parameter values. |
Robust parameter fitting process | Strong Estimate full joint posterior probability mass distribution for parameters - more stable and representative of the expectation for the parameter values. Sampling can be a computationally expensive process. |
Weak Estimate single-point max-aposterioi-likelihood (density) of parameters - this can be far outside the probability mass and so is prone to overfitting and only correct in the limit of infinite data. But optimization method can be computationally cheap. |
Fitted parameters have meaningful summary statistics for inference | Very strong Full marginal probability distributions can be interpreted exactly as probabilities. |
Weak Point estimates only have meaningful summary statistics under strict limited assumptions of model statistical validity. |
continues ...
... continued
Desirable Trait | Bayesian Inference | Frequentist Max-Likelihood |
---|---|---|
Robust model evaluation process | Strong Use entire dataset, evaluate via Leave-One-Out Cross Validation (best theoretically possible). |
Weak Cross-validation rarely seen in practice, even if used, rarely better than 5-fold CV. Simplistic method can be computationally cheap. |
Predictions made with quantified variance | Very strong Predictions made using full posterior probability distributions, so predictions have full empirical probability distributions. |
Weak Predictions using point estimates can be bootstrapped, but predictions only have interpretation under strict limited assumptions of model validity. |
Handle imbalanced, high cardinality & hierarchical factor features | Very strong Can introduce partial-pooling to automatically balance factors through hierarchical priors. |
Weak Difficult to introduce partial-pooling (aka mixed random effects) without affecting strict limited assumptions of model validity. |
Handle skewed / multimodal / extreme value target variable | Very strong Represent the model likelihood as any arbitrary probability distribution, including mixture (compound) functions e.g. a zero-inflated Weibull. |
Weak Represent model likelihood with a usually very limited set of distributions. Very difficult to create mixture compound functions. |
Handle small datasets | Very strong Bayesian concept assumes that there is a probable range of values for each parameter, and that we evidence our prior on any amount of data (even very small counts). |
Very weak Frequentist concept assumes that there is a single true value for each parameter and that we only discover that value in the limit (of infinite observations). |
Automatically impute missing data | Very strong Establish a prior for each datapoint, evidence on the available data within the context of the model, to automatically impute missing values. |
Very weak No inherent method. Usually impute as a pre-processing step with weak non-modelled methods. |
Practical Implementations of Bayesian Inference¶
We briefly referenced Bayes Rule above, which is a useful mnemonic when discussing Bayesian Inference, but in practice the crux of putting these advanced statistical techniques into practice is estimating the evidence $P(D)$ i.e. the probability of observing the data that we use to evidence the model
$$ \begin{aligned} \underbrace{P(\hat{\mathcal{H}}|D)}_{\text{posterior}} &= \frac{\overbrace{P(D|\mathcal{H})}^{\text{likelihood}} \cdot \overbrace{P(\mathcal{H})}^{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}} \\ \\ \text{...where:} \\ \\ P(D) &\sim \int_{\Theta} P(D, \theta)\ d\theta \\ \end{aligned} $$
This joint probability $P(D, \theta)$ of data $D$ and parameters $\theta$ requires an almost impossible-to-solve integral over parameter-space $\Theta$. Rather than attempt to calculate that integral, we do something that sounds far more difficult, but given modern computing capabilities is actually practical.
We use a Bleeding-edge MCMC Toolkit for Bayesian Inference: pymc
& arviz
¶
We use Markov Chain Monte-Carlo (MCMC) sampling to take a series of ergodic, partly-reversible, partly-randomised samples of model parameters $\theta$, and at each step compute the ratio of log-likelihoods $\log P(D|\mathcal{H})$ between a starting position (current values) $\theta_{p0}$ and proposed "sampled" position $\theta_{p}$ in parameter space, so as to reduce that log-likelihood (whilst exploring the parameter space).
This results in a posterior estimate $P(\hat{\theta}|D)$:
$$ \begin{aligned} P(\hat{\theta}|D) &\sim \frac{\overbrace{P(D|\theta_{p})}^{\text{likelihood @ proposal}} \cdot \overbrace{P(\theta_{p})}^{\text{prior @ proposal}}} {\underbrace{P(D|\theta_{p0})}_{\text{likelihood @ current}} \cdot \underbrace{P(\theta_{p0})}_{\text{prior @ current}}} \end{aligned} $$
This is the heart of MCMC sampling: for detailed practical explanations see Betancourt, 2021, Carroll, 2019, and Tweicki, 2015
We use the bleeding-edge pymc
and arviz
Python packages to provide the full Bayesian toolkit that we require, including advanced sampling, probabilistic programming,
statistical inferences, model evaluation and comparison, and more.
1. Orientation: Copula Functions and Their Behaviour¶
1.1 Create Synthetic Copula Dataset¶
We can learn a lot by creating a synthetic copula dataset using a "forward-pass":
Start with random data $C$ sampled from the PDF of a Latent Copula function $\square_{\mathfrak{C}}$ with 2-dimensions $$(C_{0}, C_{1}) \sim \square_{\mathfrak{C}}$$
Transform each dimension of the coupled data through the CDF of the copula function $\Phi_{\mathfrak{C}}$ to yield data distributed according to a Latent Uniform distribution $U$ $$(U_{0}, U_{1}) = \Phi_{\mathfrak{C}}(C_{0}, C_{1})$$
Transform each dimension of now-uniform data through the Inverse CDF of our chosen marginal distribution $\Phi^{-1}_{\mathfrak{M}}$ to yield data distributed according to "observed" Marginal distribution(s) $M$ $$(M_{0}, M_{1}) = \Phi^{-1}_{\mathfrak{M}}(U_{0}, U_{1})$$
In the following slides we'll plot the distributions and describe the transformations. Also see project class
synthetic.create_copula.CopulaBuilder
for details
Note we create $60$ observations split into 2 sets: $50$ for train
(in-sample) and $10$ for holdout
(out-of-sample)
1.2 Visualise the Synthetic Observations¶
1.2.1 View the Latent Copula (an MvN)¶
In this forward-pass to create the synthetic data, we firstly create $50$ observations of of a 2-dimensional Multivariate Normal distribution with covariance $\Sigma$
$$ \begin{aligned} (C_{0}, C_{1}) &\sim \square_{\mathfrak{C}} \\ &\sim \text{MultivariateNormal}(\mu, \Sigma, \text{shape}=2) \end{aligned} $$
This forms our Latent Copula (a Gaussian), and this is where we could get creative and use any number of alternative copula functions from the literature (e.g. Clayton, Frank, Gumbel, etc) or even create our own: the copula marginals dont have to be the same distribution
Observe:
- Note the standard
Normal(0,1)
scaling on the marginals - Note the empirically-observed correlation $\rho \approx -0.7$ as defined in
c_cov
1.2.2 View the Uniform-Transformed Marginals¶
In this forward-pass to create the synthetic data, next we pass each dimension of the Latent Copula $C$ through the CDF of it's own function $\Phi_{\mathfrak{C}}$ to get a Latent Uniform distribution $U$
$$ \begin{aligned} (U_{0}, U_{1}) &= \Phi_{\mathfrak{C}}(C_{0}, C_{1}) \\ &= \text{NormalCDF}(C_{0}, C_{1}) \end{aligned} $$
Regardless of the latent copula, this intermediate step will result in 2 Uniform marginals (which still have covariance)
Observe:
- Now the marginals are uniform, but the correlation remains
1.2.3 View the Observed Marginals m0
, m1
(post transformation)¶
In this forward-pass to create the synthetic data, next we pass each dimension of the Latent Uniform $U$ through the Inverse CDF of the marginal distribution $\Phi^{-1}_{\mathfrak{M}}$ to get the Marginal distribution(s) in $M$
$$ \begin{aligned} (M_{0}, M_{1}) &= \Phi^{-1}_{\mathfrak{M}}(U_{0}, U_{1}) \\ &= \text{LogNormalInvCDF}(U_{0}, U_{1}) \end{aligned} $$
The marginal distribution(s) $M$ can be anything. In practice we tend to use right-tailed distributions in the Exponential family, here a LogNormal. We can, of course, use different distributions on each marginal - there's no constraint to be the same - but we use the same ones here. This is the data that we would observe in the real-world dataset:
Observe
- Marginals now have unique long-tail distributions, and the correlation remains
1.2.4 View the Marginals $Mx$ if they were synthesized without a Copula¶
In project class synthetic.create_copula.CopulaBuilder
we also synthesize uncorrelated observations using the same
transformation and final marginals $M$, so that we can visually compare the different effects.
$$ \begin{aligned} (C^{\chi}_{0}, C^{\chi}_{1}) &\sim \text{Normal}(\mu, \sigma, \text{shape}=2) \\ (U^{\chi}_{0}, U^{\chi}_{1}) &= \text{NormalCDF}(C^{\chi}_{0}, C^{\chi}_{1}) \\ (M^{\chi}_{0}, M^{\chi}_{1}) &= \text{LogNormalInvCDF}(U^{\chi}_{0}, U^{\chi}_{1}) \\ \end{aligned} $$
Uncorrelated marginals $M^{\chi}$ individually look the same as $M$. We have to look at the joint distribution to see the difference
Observe
- Spherical joint distribution, no correlation between our marginals here
1.2.5 Overplot Marginals Correlated ($M$) vs Uncorrelated ($M^{\chi}$) to Highlight the Differences¶
Observe
- The marginals look almost identical, but the joint distribution is very different
- We might say "so what?" because we can always jointplot our marginals $M$ and see that there is correlation
- The huge impact is that these lead to a very different joint product $y$...
1.3 Compare the Impact on Joint Product $y$¶
If we build a model $\mathcal{H}$ that fits to marginals $M$, but does not account for the correlation, $\mathcal{H}$ will behave as if we fit it on uncorrelated marginals $M^{\chi}$. The predicted differences in $M$ and $M^{\chi}$ won't look too different on the marginals, but the joint products $y = M_{0} \cdot M_{1}$ vs $y^{\chi} = M^{\chi}_{0} \cdot M^{\chi}_{1}$ can become very different:
Observe:
- This customer diagnostic combination plot shows:
- Pointplot (left): The bootstrapped sums $\sum_{i}y_{i}$ vs $\sum_{i}y^{\chi}_{i}$
- Boxplot (center): The individual values $y_{i}$ vs $y^{\chi}_{i}$
- Countplot (right): The counts of observations $i$
- The (bootstrapped) sum of
y_uncorr
($\mu \approx 700$) is almost always much higher than fory_corr
($\mu \approx 500$) - This shows that if our model $\mathcal{H}$ were to estimate marginals correctly but ignore the covariance, it would erroneously mis-estimate the joint distribution total value $y$. Here that mistake is to overestimate.
View the overestimate $\delta = \sum_{i}y^{\chi}_{i} - \sum_{i}y_{i}$¶
Let's view the bootstapped overestimate delta = y_uncorr = y_corr
Observe:
If we imagine this to be a portfolio of $50$ policies, and the value of interest is an Expected Loss Cost $y = \mathbb{E}_{\text{loss}}$, and the units are dollars, then:
- If we were to use a model that ignores covariance, we might get a portfolio estimate of $\mathbb{E}_{\text{loss}} \approx 200$ dollars higher than if we were to use a better model that handles covariance with a copula function
- This overestimate is a substantial $\frac{700}{500} \approx \mathbb{+40\%}$ and would likely make the difference between profitable pricing / accurate reserving, or greatly loss-making business over the portfolio.
2. Brief Technical Summary: The Copula Model designed in this Project¶
Again, this Intro is for verbal presentation and discussion purposes only - ideally followed by a deeper technical walkthrough of the project in a long-form style. Because this project is a reference, it contains huge amounts of detail which is not worthwhile to summarise too much.
The interested reader should refer to the project notebooks where state the architecture in full, we evaluate the behaviour and performance of the models in a consistent Bayesian workflow, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models.
Here we can can highlight a very tangible impact of our results of using a Copula model (ModelA2
) vs a Naive model
(ModelA0
)
2.1 Brief Orientation on Model Workflow and Architecture¶
General Approach¶
- We create a synthetic dataset with 60 observations: these have exogenous values on 2 marginals $M_{0}$, $M_{1}$
- We create 3 models of increasing sophistication to estimate $\hat{M_{0}}$, $\hat{M_{1}}$ and thus the joint product $\hat{y} = \hat{M_{0}} \cdot \hat{M_{1}}$
- The simplest naive model (
ModelA0
) does not include a copula function, and the most sophisticated modelModelA2
does
...
- We define a training set of 50 random observations, fit the models, and view the forecasted predictions on a holdout set of 10 observations
- We fully evaluate the models in the project notebooks using a variety of sophisticated techniques including In-sample Prior & Posterior Retrodictive ECDF plots, LOO-PIT calculations & plots, and more convential coverage, RSME and R2 calculations. This forecast on the holdout is not a formal model evaluation
- However for discussion and elucidation we can plot the bootstrapped sum of the actual values $\sum{y}_{\text{holdout}}$ and compare to the posterior predictions $\sum{\hat{y}}_{\text{holdout}}$ of the two models
General Architecture¶
In contrast to the "forward-pass" that we use to synthesize the data, for the model we must of course start with the only data that we have (the observed marginals $M$) and work in a "backwards-pass" toward the copula.
Define 2 marginal distributions $(\mathfrak{M}_{0}, \mathfrak{M}_{1})$ (here for convenience we use the same family (Lognormals) for each, so we will represent as simply $\mathfrak{M}$). Each marginal is parameterised by a linear submodel $\beta^{T}\mathbb{x}_{ij}$ to allow linear regression onto selected features $j$ $$\mathfrak{M} = \text{LogNormal}(\mu=\beta^{T}\mathbb{x}_{ij}, \sigma)$$
Transform each dimension of the observed marginal data $M$ through the CDF of the marginal distribution(s) to yield data distributed according to a Latent Uniform distribution $U$ $$(U_{0}, U_{1}) = \Phi_{\mathfrak{M}}(M_{0}, M_{1})$$
Transform each dimension of now-uniform data $U$ through the Inverse CDF of the copula distribution(s) $(\mathfrak{C}_{0}, \mathfrak{C}_{1})$ (here for convenience we use the same family (Normal aka a Gaussian Copula) for each, so we will represent as simply $\mathfrak{C}$) $$(C_{0}, C_{1}) = \Phi^{-1}_{\mathfrak{C}}(U_{0}, U_{1})$$
Evidence the transformed data $C$ against the copula function $\log \mathcal{L}\ \mathfrak{C}$
For stability and correctness, we also evidence at the marginals and minimise a Jacobian adjustment on the double-transformed data.
Importantly, and unlike other model specifications in the Bayesian literature, we preserve the full posterior distribution(s) all the way through the model specification, without ever having to collapse to point estimates
Math Specification¶
Marginals $M_{k}, k \in {0, 1}$ both $\text{LogNormal}$, evidence likelihood:
$$ \begin{aligned} \beta_{M_{k}}^{j0} &\sim \text{Normal}(\mu, \sigma) \\ \sigma_{M_{k}} &\sim \text{InverseGamma}(\alpha, \beta) \\ \mathcal{L}\ \hat{M_{k}} &\sim \text{LogNormal}(\beta_{M_{k}}^{T}\mathbf{x}_{i}^{jk}, \sigma_{M_{k}}) \end{aligned} $$
Transform observed $M_{k}$ to Latent $\text{Uniform}$ $U_{k}$ via their assumed CDFs:
$$ \begin{aligned} U_{k} &= \mathfrak{m_{k}}\Phi(M_{k}) \end{aligned} $$
Transform $\text{Uniform}$ Latent $\text{Normal}$ $C_{k}$ via a Normal InverseCDF
$$ \begin{aligned} C_{k} &= \text{MvNormal}(\mu=0, \sigma=1)\Phi^{-1}(U_{k}) \end{aligned} $$
Create a latent covariance structure $\Sigma$
$$ \begin{aligned} L &\sim \text{LKJCholesky}(2), \; R \sim \text{LKJCorr}(2) \\ \sigma &\sim \text{InverseGamma}(\alpha, \beta) \\ \Sigma &\sim LL^{T} = diag(\sigma) * R * diag(\sigma) \end{aligned} $$
Evaluate likelihood $\hat{C}$ of transformed marginals using MVNormal:
$$ \begin{aligned} \mathcal{L}\ \hat{C} &\sim \text{MvNormal}(\mu=0, \Sigma, \text{observed}=C_{k}) \end{aligned} $$
Plate Notation¶
Refer to Notebook 102_ModelA2.ipynb
and project class models.copula.ModelA2
for full details.
This advanced, fully Bayesian architecture allows for:
- Regression via linear submodels $\beta^{T}\mathbf{x}$ on the marginals of
mhat
- Efficient covariance via an $\text{LKJCholeskyCovariance}$ structure
- A natural transformation from
M -> U -> C
including Jacobian Adjustment
2.2 Compare Estimated $\hat{y}$ ModelA0
vs ModelA2
¶
ModelA0
¶
This model sets a baseline for performance: it uses the same marginals but does not have a copula. This is "the best that one could do" with a naive non-copula architecture, and the performance / results are analogous to $M^{\chi}$ that we discussed in $\S1$
Observe:
- Now we can clearly see the impact: although the in-sample model fit was acceptable, the combined value $y$ is way off, because this model ignores copula correlation between the marginals
- The mean of $\overline{\hat{y}}$ is $\mu \approx 13.4$, is very different (and sits outside of) the bootstrapped actual data $\overline{\hat{y}}$ is $\mu \approx 9.6$
- Comparing means we have a $\frac{13.4}{9.6} \approx +40\%$ overestimate!
- We do see that the PPC distribution envelops the bootstrapped actual data, which is promising, and means the model wouldn't necessarily be wrong to use, but there is clearly room to improve!
ModelA2
¶
This is our most advanced model with a copula architecture: the performance / results are analogous to $M$ that we discussed in $\S1$
Observe:
- Now we can clearly see the impact: the Jacobian adjustment has allowed
mdla2
to estimate a much more precise and accurate value for $\hat{y}$ - The mean of $\overline{\hat{y}}$ is $\mu \approx 10.4$, and maps nicely with the bootstrapped actual data $\overline{\hat{y}}$ is $\mu \approx 9.6$
- Comparing means, we get $\frac{10.4}{9.6} \approx +7\%$ overestimate
- This is substantially better than
mdla0
, and also meaningfully improves onmdla1
ModelA2
vs ModelA0
¶
In the above, we see a reduction in the mean overestimate of $y$ from $40\%$ down to $7\%$: a 33 percentage point drop and $83\%$ relative improvement!
This is a huge difference on this very small and simple dataset, and found only by correctly modelling the covariance using a copula and a sophisticated model architecture.
2.3 Supercharged Predictions with Quantified Uncertainty: Exceedance Curve¶
Now we pull back to demonstrate the power and utility of using a Bayesian model in the first place: because the predicted output values individually & jointly have quantified uncertainty aka empirical probability. We can sum these and create an Exceedance Curve (1 - ECDF)
We can read this curve horizontally to determine the value $y$ at a particular probability $P$:
- $P_{@0.50}\ \overline{\hat{y}} \geq 10.0$, (aka $\overline{\hat{y}} \approx 10.0$ @ 1-in-2), and much closer to the actual data
- $P_{@0.05}\ \overline{\hat{y}} \geq 14.8$, (aka $\overline{\hat{y}} \approx 14.8$ @ 1-in-20)
Or we can read this curve vertically to determine the empirical probability $P$ of achieving a particular value $y$:
- $P(\overline{\hat{y}} > 15) \approx 0.05$: i.e. if we're worried about $\overline{\hat{y}} > 15$, we might be less concerned because the probability is approx 1-in-20
- $P(\overline{\hat{y}} > 17) \approx 0.01$: i.e. if we're worried about $\overline{\hat{y}} > 17$, we might be sanguine because the probability is approx 1-in-100
This appears substantially tighter and more robust than mdla0
and mdla1
Next Steps¶
Now the interested reader should dig into the Notebooks in project reference
oreum_copula
There we demonstrate the full E2E workflow for models of increasing sophistication, including several state-of-the-art methods unavailable to conventional max-likelihood / machine-learning models.
100_ModelA0.ipynb
: Core (naive) architecture: Create priors, marginal likelihoods, but no copula101_ModelA1.ipynb
: Partial architecture (extends ModelA0): Include Gaussian copula (w/ Jacobian adjustment), and several technical innovations to letpymc
work with the transformations102_ModelA2.ipynb
: Full architecture (extends ModelA1): Include Jacobian Adjustment on transformed observations
Oreum Industries © 2024