SkillAgentSearch skills...

ConfidenceIntervals

Confidence interval computation for evaluation in machine learning using the bootstrapping approach

Install / Use

/learn @luferrer/ConfidenceIntervals
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Confidence intervals for evaluation in machine learning <!-- omit in toc -->

This repository provides a simple implementation of the bootstrapping approach to compute confidence intervals for evaluation in machine learning. In this document, we first show how to install and use the toolkit and then provide a brief tutorial into the subject.

Table of contents <!-- omit in toc -->

Installation

pip install confidence_intervals

Alternatively, if you need to change the code, you can clone this repository and edit at will.

Basic usage

Below is a quick way to test the code. For more details on usage, see the notebook.

# Import the main function 
from confidence_intervals import evaluate_with_conf_int

# Define the metric of interest (could be a custom method)
from sklearn.metrics import accuracy_score

# Create a toy dataset for this example (to be replaced with your actual data)
from confidence_intervals.utils import create_data
decisions, labels, conditions = create_data(200, 200, 20)

# Run the function. In this case, the samples are represented by the categorical decisions
# made by the system which, along with the labels, is all that is needed to compute the metric.
samples = decisions
evaluate_with_conf_int(samples, accuracy_score, labels, conditions, num_bootstraps=1000, alpha=5)

The code above produces the following output

(0.855, (0.7938131968651883, 0.9126023142471228))

The first number is the metric value on the full dataset. The list indicates the lower and upper bound of the confidence interval.

The notebook includes more examples on how this function may be used and how to plot the resulting confidence intervals.

Input data

The arguments to the evaluate_with_conf_int function are the following ones.

Required arguments

  • Samples: an array with a value needed to compute the metric for each sample. This will generally be the system's output (scores or decisions). Yet, for metrics that are simple averages of some per-sample loss, these values can simply be the per-sample losses (more on this below).
  • Metric: The metric to be used for assessing performance. The function will be called internally as metric([labels], samples, [samples2]), where the two arguments in brackets are optional (if they are None, they are excluded from the call).

Optional arguments

  • Labels: an array with the label, or any other information needed to compute the metric in addition to the value included in the samples array, for each sample. Default=None.
  • Conditions: an array of integers indicating the conditions of the samples (e.g., the speaker identity). This argument can be None if the samples can be considered iid. If conditions is not None, all samples with the same condition will be sampled together when doing bootstrapping. Default=None
  • num_bootstraps: the number of bootstrap sets to be created. Default=1000.
  • alpha: the level of the interval. The confidence interval will be computed between alpha/2 and 100-alpha/2 percentiles. Default=5.
  • samples2: a second array of samples for metrics that require an additional input.

The metric, samples, and labels can be as simple or as complex as your task requires. The table below shows some examples on how the different inputs may be defined.

<center>

| Task | Metric | Sample | Label | Condition | |------------------------------|------------------|-------------------------------------|-------------------|-------------------| | Emotion classification | Accuracy | System's decision | Emotion label | Speaker | | Emotion classification | Accuracy | 0 or 1 (correct/incorrect decision) | - | Speaker | | Speaker verification | EER | System's score for trial | Target/Impostor | See comment below | | Automatic speech recognition | Av. WER | Per-sample WER | - | Speaker | | Automatic speech recognition | Weighted Av. WER | Per-sample WER | Num words | Speaker | | Diarization | Weighted Av. DER | Per-sample DER | Num speech frames | Speaker |

</center>

Some notes:

  • For metrics that are averages of some loss over the samples like the accuracy, or the average WER, the sample can be represented directly by the per-sample loss, the metric is simply the average of the loss over the samples, and the label is not needed.
  • For the weighted average WER and DER metrics often used for ASR and diarization, where the weights are given by the number of words in each sample or the number of speech frames, respectively, the label field can be used to provide the number of words or speech frames for each sample so that the metric can be computed from the individual WER or DER values and this quantity (see example in the notebook).
  • While for speech tasks the speaker is the most common correlation-inducing factor, other factors may exist, like the recording session (if more than one sample is generated in a session) or the original waveform (if samples are waveform chunks extracted from longer waveforms).
  • In speaker verification, bootstrapping by condition is harder than for other tasks because both sides in a trial (the enrollment and the test side) have their own condition. The code in this repository cannot handle this particular case. Instead, joint bootstrapping is needed. Please see the code in the DCA-PLDA github repository (compute_performance_with_confidence_intervals method) for an example of how to do joint bootstrapping for speaker verification.

Tutorial

The goal of evaluation in machine learning is to predict the performance a given system or method will have in practice. Here, we use the word "system" to refer to a frozen model, with all its stages, parameters, and hyperparameters fixed. In contrast, we use the word "method" to refer to an approach which will eventually be instanced into a system after training any trainable parameters.

Evaluation in machine learning is done by computing some performance metric of choice --one that is relevant to our application-- on a test dataset, assuming that this data is representative of the use-case scenario. Now, the metric value we obtain after this process will depend on a number of random factors: the data we happen to have in our hands for training, developing and testing the systems, and the random seeds we may have used during system training. Any change in these random factors will result in a change in performance which, in turn, might change our conclusions on which system or method is best or how well or badly they will perform in practice. So, it is essential to take these random factors into account when trying to derive scientific or practical conclusions from empirical machine learning results.

Below, we describe two of the most common evaluation scenarios: 1) evaluation of systems, 2) evaluation of methods [Dietterich, 1998]. Further, we describe how the bootstrapping technique can be used to compute confidence intervals. This approach has the advantage over other methods commonly used for computing statistical significance or confidence intervals in that it has no assumption on the distribution of the data. Further, it can properly handle datasets where the samples are not iid. For example, if the test dataset is composed of samples from different speakers, each contributing various samples, the speaker identity will introduce correlations between the samples. Ignoring these correlations when computing confidence intervals would result in intervals that are narrower than they should be. Bootstrapping can be easily adapted to take these correlation-inducing factors into account.

Evaluation of systems

Perhaps the simplest evaluation scenario is one where we have a number of different systems already trained and we want to know how they will perform when deployed. Our goal in this scenario is to predict as best as possible how each system will perform on yet-unseen data.

To obtain an estimate of the performance that a system of interest will have in future data, we need a set of data that is representative of the one we will eventually see during deployment. We can then run that data through the system and compute its performance. Now, say that we have two systems, A and B, and system A turns out to be better than B by 5%. We might then wonder: does this really mean that B will be better than A in practice? Or, in other words, if we were to change the test dataset to a new set of samples from the same domain, would system B still be better? This question can be addressed by estimating the variability that the metric has as a function of the test dataset, which can be done with the bootstrapping approach.

The bootstrap approach for computing confidence intervals for system evaluation <!-- omit in toc -->

The basic steps to compute confidence intervals based on the bootstrapping approach for assessing the effect of the test data on a system's performance are as follow. Given a dataset with $N$ samples:

  1. Repeat the two steps below $B$ times.
    • Sample the test dataset with replacement to get $N$ samples. The new dataset will be of the same size as the original, but will have some missing and some repea

Related Skills

View on GitHub
GitHub Stars96
CategoryEducation
Updated27d ago
Forks9

Languages

Jupyter Notebook

Security Score

95/100

Audited on Feb 28, 2026

No findings