Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a gold standard can be compared simply by using the order induced by their agreement measure with respect to the gold standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen’s kappa, but they are mainly naïve, and their boundaries are arbitrary. We propose a general approach to evaluate the significance of any agreement value between two classifiers as the probability of randomly choosing a confusion matrix, built over the same data set, with a lower agreement value. This measure, named significativity, gauges the relevance of the observed agreement value rather than replacing the agreement measure used to calculate it. This work introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. This manuscript also addresses the computational challenges of evaluating such indices and proposes some efficient algorithms for their evaluation.
Significativity Indices for Agreement Values
Casagrande, Alberto;Girometti, Rossano;Pagliarini, Roberto
2025-01-01
Abstract
Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a gold standard can be compared simply by using the order induced by their agreement measure with respect to the gold standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen’s kappa, but they are mainly naïve, and their boundaries are arbitrary. We propose a general approach to evaluate the significance of any agreement value between two classifiers as the probability of randomly choosing a confusion matrix, built over the same data set, with a lower agreement value. This measure, named significativity, gauges the relevance of the observed agreement value rather than replacing the agreement measure used to calculate it. This work introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. This manuscript also addresses the computational challenges of evaluating such indices and proposes some efficient algorithms for their evaluation.| File | Dimensione | Formato | |
|---|---|---|---|
|
s11222-025-10728-1.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
933.37 kB
Formato
Adobe PDF
|
933.37 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


