Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels.

On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation

Roitero K.;Maddalena E.;Mizzaro S.;
2021

Abstract

Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of information retrieval systems. Our work analyzes and compares four crowdsourced scales (2-levels, 4-levels, and 100-levels ordinal scales, and a magnitude estimation scale) and two expert-labeled datasets (on 2- and 4-levels ordinal scales). We compare the scales considering internal and external agreement, the effect on IR evaluation both in terms of system effectiveness and topic ease, and we discuss the effect of such scales and datasets on the perception of relevance levels by assessors. Our analyses show that: crowdsourced judgment distributions are consistent across scales, both overall and at the per-topic level; on all scales crowdsourced judgments agree with the expert judgments, and overall the crowd assessors are able to express reliable relevance judgments; all scales lead to a similar level of external agreement with the ground truth, while the internal agreement among crowd workers is higher for fine-grained scales; more fine-grained scales consistently lead to higher correlation values for both system ranking and topic ease; finally, we found that the considered scales lead to different perceived distances between relevance levels.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11390/1209252
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact