SkillAgentSearch skills...

ROCAnalysis.jl

Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

Install / Use

/learn @davidavdav/ROCAnalysis.jl
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ROCAnalysis.jl

CI

Install

Pkg.add("ROCAnalysis")

Introduction

Receiver Operating Characteristic Analysis functions for evaluation probabilistic binary classifiers. Allows efficient plotting of ROC curves, and many more things.

Please note there is an alternative implementation under a similar name, and support for ROC analysis also exists in MLBase.

Our implementation is more geared towards:

  • large amounts of data, with efficient ROC statistics calculation
  • Decision Cost Function analysis
  • Detection Error Trade-off (DET) analysis
  • ROC convex hull computation, analysis and EER-interpretation
  • Optimal Likelihood Ratio computation

The development roadmap is largely based on the functionality in a similar R package ROC.

Synopsis

An annotated jupyter notebook of the code below can be found here.

using ROCAnalysis
## Produce some well-calibrated log-likelihood-ratio scores for target and non-target class:
tar =  2 .+ 2randn(1000)
non = -2 .+ 2randn(100000)
## quick estimate of equal error rate, should be close to pnorm(-1) = 0.5 + 0.5erf(-1/√2)
eer(tar, non) 
## compute full ROC statistics
r = roc(tar, non)
## accurate computation of the equal error rate, using the convex hull
eerch(r)
## roc plot, we plot errors (false negatives against false positives) rather than hits vs. false alarms.
using Plots ## or perhaps another plotting package
plot(r)
## The "Detection Error Tradeoff" plot, this should give a more/less straight line
detplot(r)
## compute the Area Under the ROC, should be close to 0.078
auc(r)
## define a decision cost function by its parameter p_tar=0.01, Cfa=1, Cmiss=10 (NIST SRE 2008 setting)
d = DCF(0.01, 1, 10)
## `actual costs' using a threshold of scores at -plo(d) (minus prior log odds)
plo(d)
dcf(tar, non, d=d)
## Or use a different threshold, e.g., zero
dcf(tar, non, d=d, thres=0)
## `minimal costs' using an optimal threshold
mindcf(r, d=d)
## define an array of DCFs, and compute the decision costs for these, using a threshold at -plo
d = DCF([0.001, 0.01, 0.1, 0.5, 0.9, 0.99, 0.999], 1, 1)
dcf(tar, non, d=d)
## same, but normalized to costs of decisions based on the prior alone
dcf(tar, non, d=d, norm=true)
## prior log odds, thre crucial combination of cost parameters, log(p_tar / (1-p_tar) Cmiss / Cfa)
plo(d)
## now scan the Bayes error rate (similar to the dcf above) for a range of prior log odds, and plot
## This is known as the Applied Probability of Error plot
apeplot(r)
## The area under the red curve (actual error rates), the cost of the log-likelihood ratio
cllr(tar, non)
## The area under the green curve (minimum errors), the cost of the optimal log-likelihood-ratio
mincllr(tar, non)
## Similar to APE, but normalized---a Normalized Bayes Error plot
nbeplot(r)
## Make an `LLR' plot: score-to-optimal-LLR mapping, r.θ, vs. r.llr
llrplot(r)

Receiver Operating Characteristic

A binary classifier maps an input x to one of two classes, A and B. Internally, every classifier ends up producing some form of a scalar metric s, which can be thresholded to produce a decision.

There are two possible "senses" of this internal scalar:

  • higher values of s indicate higher probability of A
  • higher values of s indicate higher probability of B

There are of course also many different interpretations of the classes A and B. For instance, in biometrics B could mean "same individual" and A "different individual". The corresponding senses of s then have an interpretation

  • score-like: a higher value means a better match
  • distance-like: a higher value means a larger difference.

Because we want to focus in this package on a probabilistic interpretation of the scalar s, we take the "score-like" interpretation of s, i.e., higher values of s correspond to a higher likelihood of the class-of-interest to be associated to the input of the classifier. If your classifier is, in fact, a distance metric d, you could work with s = -d or s = 1/d or any other strictly decreasing function. Alternatively, you can swap around the label of the class of interest.

As the name suggests, a classifier is supposed to make decisions. Decisions can be thesholded against a fixed θ such that:

  • if s > θ, decide class B
  • if s < θ, decide class A

For evaluating the classifier, we need a set of supervised trials, i.e., for each scalar score s we need a label indicating the true class of the trial that led to score s. Because there are two classes, two types of errors can be made:

  • false positives: s > θ while in fact the true label is A
  • false negatives: s < θ while in fact the true label is B.

The Receiver Operating Characteristic (ROC) is a graph that shows how the fractions of the false positives and false negatives change with varying θ, for a fixed set of scores s. In ROCAnalysis, the structure of type Roc captures the essential information in a pre-processed way such that other quantities can be derived efficiently.

Because I come from automatic speaker recognition, I tend to use the following terminology for the classes

  • target, the higher scores, a.k.a. same source, true client, ...
  • non-target, the lower scores, a.k.a. different source, impostor, ...

Error rates

In this package, we focus at analysing everything in terms of error rates. Traditionally, researchers have used ROC curves where one of the axes is used to describe the complement of an error rate. Specifically, one often sees a true positive rate versus a false positive rate, where the true positive rate is the complement of the false negative rate. There is no real objective way to judge whether one analysis is any better than the other, usually the choice is largely dependent on traditions in the area of the research you are in.

There are many different names of the error rates in different scientific disciplines. Because I come from the area of automatic speaker recognition, the current terminology is

  • Probability of False Alarm: a.k.a. false alarm rate, false positive rate, false accept rate, false match rate, Type II error, 1 - true negative rate, 1 - specificity
  • Probability of Miss:: a.k.a. miss rate, false negative rate, false reject rate, false non-match rate, Type I error, 1 - true positive rate, 1 - sensitivity, 1 - recall, 1 - verification rate, 1 - hit rate, 1 - genuine acceptance rate

We foresee that the naming of things becomes a bit more flexible in future releases of this package.

DET plot

A detection error trade-off plot (Martin et. al., 1997) is exactly the same as a ROC plot in the error domain (i.e., miss rate vs false alarm rate), but the axes have been warped according to Φ⁻¹(x), the inverse of the cumulative normal distribution. In R, this function is known as qnorm(x), in Julia base this is √2 * erfinv(2x - 1). This type of plot has interesting properties

  • If the distributions of target and non-target scores are both Normal, then the DET-curve is a straight line. In practice, many detection problems give rise to more-or-less straight DET curves, and this suggests that there exists a strictly increasing warping function that can make the score distributions (more) Normal.
  • Towards better performance (lower error rates), the resolution of the graph is higher. This makes it more easy to have multiple systems / performance characteristics over a smaller or wider range of performance in the same graph, and still be able to tell these apart.
  • Conventionally, the ranges of the axes are chosen 0.1%--50%. This makes it possible to immediately assess the overall performance based on the absolute position of the line in the graph if you have seen more DET plots in your life.
  • The slope of the (straight) line corresponds to the ratio of the σ parameters of the underlying Normal score distributions, namely that of the non-target scores divided by that of the target scores. Often, highly discriminative classifiers show very flat curves, indicating that that target scores have a much larger variance than the non-target scores.
  • The origin of this type of plot lies in psychophysics, where graph paper with lines according to this warping was referred to as double probability paper. The diagonal y = x in a DET plot corresponds linearly to a quantity known as d' (d-prime) from psychophysics, ranging from 0 at 50% error to about 6 at 0.1% error.

Discrete and continuous scores

There is an essential difference between discrete score (classes) and continuous scores. For the former, trials with the same scores must be grouped before the probabilities of false alarm and miss are computed. This results in ROC and DET plots that can have line elements that are not solely horizontal or vertical. This is contrary to the latter case if we assume that no two scores are (coincidentally) the same, which leads to only horizontal and vertical line segments. This ROCAnalysis package makes sure that the occurrence of identical scores is treated correctly by sorting target scores before identical non-target scores, and by treating trials with scores at the threshold always as errors.

Plot optimisation

For larget trial sets, it is very likely that in the extremes of the score distributions there is very little overlap. This will results in many consecutive horizontal or vertical line segments in the plot. This ROCAnalysis package integrates th

Related Skills

View on GitHub
GitHub Stars33
CategoryDevelopment
Updated11mo ago
Forks10

Languages

Jupyter Notebook

Security Score

67/100

Audited on Apr 23, 2025

No findings