Towards building a standard dataset for Arabic keyphrase extraction evaluation

Helmy, Muhammad; Basaldella, Marco; Maddalena, Eddy; Mizzaro, Stefano; Demartini, Gianluca

doi:10.1109/IALP.2016.7875927

Keyphrases are short phrases that best represent a document content. They can be useful in a variety of applications, including document summarization and retrieval models. In this paper, we introduce the first dataset of keyphrases for an Arabic document collection, obtained by means of crowdsourcing. We experimentally evaluate different crowdsourced answer aggregation strategies and validate their performances against expert annotations to evaluate the quality of our dataset. We report about our experimental results, the dataset features, some lessons learned, and ideas for future work