Size and source matter: Understanding inconsistencies in test collection-based evaluation