Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a gold standard can be compared simply by using the order induced by their agreement measure with respect to the gold standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen’s kappa, but they are mainly naïve, and their boundaries are arbitrary. We propose a general approach to evaluate the significance of any agreement value between two classifiers as the probability of randomly choosing a confusion matrix, built over the same data set, with a lower agreement value. This measure, named significativity, gauges the relevance of the observed agreement value rather than replacing the agreement measure used to calculate it. This work introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. This manuscript also addresses the computational challenges of evaluating such indices and proposes some efficient algorithms for their evaluation.

Significativity Indices for Agreement Values

Casagrande, Alberto;Girometti, Rossano;Pagliarini, Roberto
2025-01-01

Abstract

Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a gold standard can be compared simply by using the order induced by their agreement measure with respect to the gold standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen’s kappa, but they are mainly naïve, and their boundaries are arbitrary. We propose a general approach to evaluate the significance of any agreement value between two classifiers as the probability of randomly choosing a confusion matrix, built over the same data set, with a lower agreement value. This measure, named significativity, gauges the relevance of the observed agreement value rather than replacing the agreement measure used to calculate it. This work introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. This manuscript also addresses the computational challenges of evaluating such indices and proposes some efficient algorithms for their evaluation.
File in questo prodotto:
File Dimensione Formato  
s11222-025-10728-1.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 933.37 kB
Formato Adobe PDF
933.37 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1314144
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact