qumphy.uq_metrics module

File: qumphy/uq_metrics.py Project: QUMPHY Contact: vivek.desai@npl.co.uk Gitlab: https://gitlab.com/qumphy Description: Evaluation metrics for assessing model calibration for classification and regression tasks.

qumphy.uq_metrics.adaptive_calibration_error(predictions, targets, *args, range_number=15, threshold=0.0)[source]

Calculate the Adaptive Calibration Error (ACE) for the predicted class probabilities. This metric will work for both the binary and multiclass classification cases, with the option to apply thresholding as done in the original paper.

References:

The ACE is introduced in the following paper: <https://arxiv.org/pdf/1904.01685>. This implementation is heavily inspired by the more general calibration error functions in: <https://github.com/JeremyNixon/uncertainty-metrics-1/tree/master>. We present a simplified version here, focused on the ACE alone.

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, m) for classification. Note m >= 2 i.e. probabilities given for all classes.
targets: np.ndarray: Target class labels with shape (n, 1)
*args:: Used to catch any un-used arguments in the main UQ evaluation notebook.
range_number: int, optional: Number of ranges to be used (note: ranges are equal frequency bins in the mean confidence axis), by default 15
threshold: float, optional: Threshold value below which probabilities are ignored, by default 0.0 for the ACE i.e. no thresholding.

Returns:

:

ACE: float: The ACE metric value.

qumphy.uq_metrics.continuous_ranked_probability_score(predictions, targets, uncertainties, converter='Gaussian', distributions=None, is_var=True)[source]

Calculate the Continuous Ranked Probability Score (CRPS) from the given model predictions, uncertainties, and corresponding ground truths. The option is given to calculate the CRPS either using the analytical expression given the model predictions parameterising a Gaussian, or using numerical integration. The numerical integration method for a distribution estimated with KDE yields errors of <1% compared to the analytical expression for a Gaussian.

Return type:: float

References:

The CRPS and its analytical expression for a univariate Gaussian is given in Gneiting et al., 2007: <https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf>.

Parameters:

predictions (np.ndarray):: An array of the model predictions with shape (n, 1).
targets (np.ndarray):: Array of target values with shape (n, 1).
uncertainties (np.ndarray):: Array of the total uncertainties associated with the model predictions from a chosen UQ method, with shape (n, 1).
converter (str, optional):: Specify whether the converter used was “Gaussian” or “KDE”; evaluates CRPS using numerical integration of CDF if “KDE”, otherwise uses analytical expression for Gaussian. Defaults to “Gaussian”.
distributions (List[distribution objects], optional):: Option to include the list of distributions as a direct input - needed if input is from KDE conversion of prediction interval to distribution. Defaults to None.
is_var (bool, optional):: Boolean parameter that sets whether to convert the uncertainties from variances to standard deviations. Defaults to True.

Returns:

:

CRPS (float):: The average CRPS value across the dataset.

qumphy.uq_metrics.coverage_calibration_error(predictions, targets, uncertainties, filepath=None, is_var=True, save_fig=False, number_of_intervals=10)[source]

Calculate the confidence interval-based metric for model predictions and their corresponding uncertainties. If uncertainties are given as variances, they are converted to standard deviations.

Parameters:

predictions: np.ndarray: Array of model predictions with shape (n, 1)
targets: np.ndarray: Array of target values with shape (n, 1)
uncertainties: np.ndarray: Array of total uncertainties (aleatoric + epistemic) with shape (n, 1)
filepath: str: Path to directory to save the coverage calibration curve.
is_var: bool, optional: Boolean parameter that determines whether to convert the uncertainties from variances to standard deviations, by default True
save_fig: bool, optional: Boolean parameter that sets whether to save the plot to the filepath, by default True.
number_of_intervals: int, optional: The number of confidence intervals to evaluate the calibration error on, by default 10 (range 0 to 1 in steps of 0.1)

Returns:

:

calibration_error: float: The confidence interval-based calibration error metric

qumphy.uq_metrics.entropy(p)[source]

Calculate the normalised entropy of the probabilities array.

Return type:: ndarray

Parameters:

p (np.ndarray):
Probabilities from classifier.

Returns:

:

ents (np.ndarray):: Entropy values for each sample in the probability array.

qumphy.uq_metrics.expected_calibration_error(predictions, targets, filepath=None, save_fig=False, bin_number=15)[source]

Calculate the Expected Calibration Error (ECE) for the predicted class model probabilities for classification tasks.

References:

The script below is heavily based on the following implementations of the ECE: <https://medium.com/towards-data-science/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d>

and

<https://github.com/gpleiss/temperature_scaling/blob/master/temperature_scaling.py>.

There exists alternative packages that calculate an ECE value, such as the netcal package. Depending on the implementation of binning in each package, the ECE values differ by small amounts (~1e-3). In the QUMPHY evaluation framework, the quoted ECE value is determined using this function.

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, k) for k-class classification.
targets: np.ndarray: Target class labels with shape (n, 1).
filepath: str: The filepath for the directory to save the reliability diagram to.
save_fig: bool: Boolean to determine whether to save the reliability diagram to the given filepath.
bin_number: int, optional: Number of bins to be used (note: bins are equal width in the mean confidence axis), by default 15.

Returns:

:

ECE: float: ECE value for the input data

qumphy.uq_metrics.expected_cumulative_calibration_errors(predictions, targets, filepath, save_fig=True)[source]

Calculating the ECCE-MAD (aka Kolomogorv-Smirnov statistic) and ECCE-R (aka Kuiper statistic) metrics. The saved plot shows the deviation from zero of the cumulative errors.

References:

The ECCE metrics are introduced in the following: <https://arxiv.org/pdf/2205.09680>. This implementation is from the cumulative function in: <https://github.com/facebookresearch/ecevecce/blob/main/codes/calibration.py>.

The alterations to the cumulative function simply involve editing the format of the inputs to the function to be consistent with the other metrics. In addition, the normalised ECCE_MAD and ECCE_R metrics are given, as this is easier to compare with the target limits of convergence under the null hypothesis of perfect calibration.

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, 2) for binary classification
targets: np.ndarray: Target class labels with shape (n, 1)
filepath: str: Path to directory to save the plot of the cumulative calibration errors.
save_fig: bool: Boolean to determine whether to save the reliability diagram to the given filepath.

Returns:

:

norm_ECCE-MAD: float: The maximum absolute deviation from zero statistic (Kolmogorov-Smirnov) for the cumulative differences between the confidences and outcomes
norm_ECCE-R: float: The range of deviation from zero statistic (Kuiper) for the cumulative differences between the confidences and outcomes
statistical_significance_scale: float: Normalisation factor that is used to assess statistically significant deviations from the asymptotic expected values (under large sample size n)

qumphy.uq_metrics.expected_normalised_calibration_error(predictions, targets, uncertainties, filepath=None, is_var=False, save_fig=False, bin_number=15)[source]

Calculating the Expected Normaised Calibration Error (ENCE) for the model predictions and the corresponding total uncertainties.

Parameters:

predictions: np.ndarray: An array of the model predictions with shape (n, 1)
targets: np.ndarray: Array of target values with shape (n, 1)
uncertainties: np.ndarray: Array of the total uncertainties associated with the model predictions from a chosen UQ method, with shape (n, 1)
filepath: str: Path of directory to save the reliability diagram.
is_var: bool, optional: Boolean parameter that sets whether to convert the uncertainties to standard deviations, by default True
save_fig: bool, optional: Boolean parameter that sets whether to save the plot to the filepath, by default True.
bin_number: int, optional: Chosen number of bins for calculating ENCE (note: the bins are defined to have an equal number of samples per bin), by default 15

Returns:

:

ENCE: float: The value of the ENCE calculated for the given inputs

qumphy.uq_metrics.flatten_list_of_batches(output_list, concat_axis=-1)[source]

Function to flatten model output arrays before they are used to calculate the calibration error metrics.

Parameters:

output_list: list: List to be flattened into 1D np.ndarray
concat_axis: int, optional: Axis on which to perform the flattening, by default -1 (last axis)

Returns:

: flattened_array: np.ndarry

Flattened array

qumphy.uq_metrics.negative_log_likelihood(predictions, targets, *args)[source]

Compute the negative log likelihood metric for the predicted probabilities for k-class classification. Values are clipped to avoid divide by zero errors when taking the base-2 logarithm.

Return type:: float

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, k) for k-class classification.
targets: np.ndarray: Target class labels with shape (n, 1).
*args:: Used to capture other arguments passed to the function that are not necessary e.g. filepath for reliability diagram.

Returns:

:: nll (float): Calculated NLL value.

qumphy.uq_metrics.picp(predictions, targets, confidence_level, is_interval=True, uncertainties=None, is_var=True)[source]

Calculate the Prediction Interval Coverage Probability (PICP) for predictions given either as intervals or point-predictions, for a specified confidence level. If inputs are given as point-predictions, uncertainty (as variance/standard deviations) arrays must be given, from which intervals are derived, assuming a Gaussian distribution.

The Mean Prediction Interval Width (MPIW) is also returned. Note this is an unbounded non-negative metric.

Return type:: tuple

Parameters:

predictions: np.ndarray
Array of model predictions with shape either (n, 2) for intervals or (n, 1) for point predictions

targets: np.ndarray
Array of target values with shape (n, 1)

confidence_level: float
The confidence level of the prediction intervals. If point-predictions are given, intervals will be created assuming Gaussian distribution at this confidence level.

is_interval: bool
Boolean flag to set whether predictions are given as intervals or point-predictions. Defaults to True.

uncertainties: np.ndarray
Array of model uncertainties with shape (n, 1). Defaults to None, must be present if is_interval is False.

is_var: bool
Bollean flag corresponding to if the uncertainties are given as variances (default) or standard deviations. Defaults to True.

Returns:

:

picp: float
The prediction interval coverage probability of the model predictions. Aim is for the PICP to match the confidence level.

MPIW: float
The Mean Predicted Interval Wdith of the model predictions.

qumphy.uq_metrics.set_coverage(prediction_sets, targets)[source]

Calculate the prediction set coverage for model ouputs given from either basic conformal prediction, or sets created using top-k selection from model probabilities. Note: top-k selection from model probabilities has no coverage guarantee, in contrast to conformal methods.

Parameters:

prediction_sets (np.ndarray(List)): Array of lists containing the predicted classes. targets (np.ndarry): The ground truth class labels.

Returns:

:: float: The empirical frequency of the ground truth class labels in the set.

qumphy.uq_metrics.smooth_expected_calibration_error(predictions, targets, filepath, save_fig=True)[source]

Calculate the smooth ECE value (smECE) for the predicted class probabilities for the classification task.

Return type:: float

References:

Details on the relplot python package can be found in the following repository: <https://github.com/apple/ml-calibration/tree/main>. The accompanying paper for the smooth ECE metric, and the smooth reliability diagrams can be found here: <https://arxiv.org/pdf/2309.12236>.

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, k) for k class classification.
targets: np.ndarray: Target class labels with shape (n, 1).
filepath: str: Path to directory to save the reliability diagram.
save_fig: bool: Boolean to determine whether to save the reliability diagram to the given filepath.

Returns:

:

smECE: float: The calculated smECE value using the relplot python package.

qumphy.uq_metrics.split_into_classes(predictions, targets)[source]

Helper function to split classification predictions into predictions conditioned on the ground truth class label.

Return type:: tuple

Parameters:

predictions: np.ndarray
Model output probabilities with shape (n, k) for k-class classification.

targets: np.ndarray
Target class labels with shape (n, 1).

returns:: Predictions per-class for k classes, with corresponding ground truth class labels for each class. Size of tuple is dependent on number of classes.
rtype:: tuple

qumphy.uq_metrics.uncertainty_calibration_error(predictions, targets, filepath=None, save_fig=False, prediction_entropies=None, bin_number=15)[source]

Calculate the Uncertainty Calibration Error (UCE) for model predictions and target labels for classification. There exists the optional argument to provide an array of entropies of shape (n, 1); if None, the entropies will be calculated from the probabilities.

References:

The UCE was introduced in the following paper: <https://arxiv.org/pdf/1909.13550>. This implementation is based on: <https://github.com/mlaves/bayesian-temperature-scaling/blob/master/uce.py>.

Parameters:

predictions: np.ndarray: Model output probabilities with shape (n, k) for k-class classification.
targets: np.ndarray: Target class labels with shape (n, 1).
filepath: str: Path to directory to save the reliability diagram.
prediction_entropies: np.ndarray, optional: The entropies for the predictions, by default None; if None, then the entropies are calculated from the prediction probabilities.
bin_number: int, optional: Number of bins to be used (note: bins are equal width in the normalised entropy axis), by default 15.

Returns:

:

UCE: float: The UCE for the model predictions and their corresponding entropies.
xs: np.ndarray: Array of the mean normalised predicted entropies, used for plotting the reliability diagram.
ys: np.ndarray: Array of the empirical inaccuracy per bin, used for the plotting the reliability diagram.

qumphy.uq_metrics.variation_calibration_error(predictions, targets, filepath, save_fig=True, bin_number=15)[source]

Compute the Variation Calibration Error (VCE) metric. Extends the ECE to a measure of distributional calibration for multiclass problems. Measure of variation employed for this metric is the Shannon entropy. Bins can be defined with either equal-widths or equal-frequency.

Return type:: Tuple[float, ndarray, ndarray]

Parameters:

predictions (np.ndarray):: Prediction array with shape (n, k) for n samples and k classes.
targets (np.ndarray):: Target class labels with shape (n, 1).
save_fig (bool, default = False):: Flag to save the reliability diagram.
filepath (str, default = os.getcwd()):: Filepath to which the reliability diagram is saved.
bin_number (int, default = 10):: Number of bins used to compute metric and plot reliability diagram.

Returns:

: VCE (float):

VCE metric value.

xs (np.ndarray):: Mean entropy of predictions array for plotting the reliability diagram.
ys (np.ndarray):: Entropy of the rank distribution array for plotting the reliability diagram.

qumphy.uq_metrics.z_variance_error(predictions, targets, uncertainties, is_var=True, bin_number=15)[source]

Calculate the Z-variance error (ZVE) for the model predictions and corresponding total uncertainties.

Parameters:

predictions: np.ndarray: An array of the model predictions with shape (n, 1)
targets: np.ndarray: Array of target values with shape (n, 1)
uncertainties: np.ndarray: Array of the total uncertainties associated with the model predictions from a chosen UQ method, with shape (n, 1)
is_var: bool, optional: Boolean parameter that sets whether to convert the uncertainties from variances to standard deviations, by default True
bin_number: int, optional: Chosen number of bins for calculating ZVE (note: the bins are defined to have an equal number of samples per bin), by default 15

Returns:

:

ZVE: float: The ZVE value for the input predictions and uncertainties