Past work showed that significant inconsistencies between retrieval results occurred on different test collections, even when one of the test collections contained only a subset of the documents in the other. However, the experimental methodologies in that paper made it hard to determine the cause of the inconsistencies. Using a novel methodology that eliminates the problems with uneven distribution of relevant documents, we confirm that observing a statistically significant improvement between two IR systems can be strongly influenced by the choice of documents in the test collection. We investigate two possible causes of this problem of test collections. Our results show that collection size and document source have a strong influence in the way that a test collection will rank one retrieval system relative to another. This is of particular interest when constructing test collections, as we show that using different subsets of a collection produces differing evaluation results.

Size and source matter: Understanding inconsistencies in test collection-based evaluation

MIZZARO, Stefano;
2014-01-01

Abstract

Past work showed that significant inconsistencies between retrieval results occurred on different test collections, even when one of the test collections contained only a subset of the documents in the other. However, the experimental methodologies in that paper made it hard to determine the cause of the inconsistencies. Using a novel methodology that eliminates the problems with uneven distribution of relevant documents, we confirm that observing a statistically significant improvement between two IR systems can be strongly influenced by the choice of documents in the test collection. We investigate two possible causes of this problem of test collections. Our results show that collection size and document source have a strong influence in the way that a test collection will rank one retrieval system relative to another. This is of particular interest when constructing test collections, as we show that using different subsets of a collection produces differing evaluation results.
2014
978-145032598-1
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1108801
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? ND
social impact