Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Simeoni, Gaia; Soprano, Michael; Lunardi, Riccardo; Roitero, Kevin; Mizzaro, Stefano

doi:10.1007/978-3-032-21300-6_25

Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). In this paper, we apply an IR approach to LLM evaluation. Adapting a method developed for TREC test collections, we analyze LLM benchmark results through the lens of network science. We construct a bipartite graph between models and benchmark questions and apply Kleinberg’s HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a model’s tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that the ranking of models on leaderboards is strongly influenced by subsets of easy questions.

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Simeoni, Gaia;Soprano, Michael;Lunardi, Riccardo;Roitero, Kevin;Mizzaro, Stefano

2026-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Codice ISBN
	
				9783032212993
9783032213006
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1326265

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Simeoni, Gaia;Soprano, Michael;Lunardi, Riccardo;Roitero, Kevin;Mizzaro, Stefano

2026-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Simeoni, Gaia;Soprano, Michael;Lunardi, Riccardo;Roitero, Kevin;Mizzaro, Stefano

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)