The main purpose of this thesis is to analyze and propose new improvements in the field of Automatic Keyphrase Extraction, i.e., the field of automatically detecting the key concepts in a document. We will discuss, in particular, supervised machine learning algorithms for keyphrase extraction, by first identifying their shortcomings and then proposing new techniques which exploit contextual information to overcome them. Keyphrase extraction requires that the key concepts, or \emph{keyphrases}, appear verbatim in the body of the document. We will identify the fact that current algorithms do not use contextual information when detecting keyphrases as one of the main shortcomings of supervised keyphrase extraction. Instead, statistical and positional cues, like the frequency of the candidate keyphrase or its first appearance in the document, are mainly used to determine if a phrase appearing in a document is a keyphrase or not. For this reason, we will prove that a supervised keyphrase extraction algorithm, by using only statistical and positional features, is actually able to extract good keyphrases from documents written in languages that it has never seen. The algorithm will be trained over a common dataset for the English language, a purpose-collected dataset for the Arabic language, and evaluated on the Italian, Romanian and Portuguese languages as well. This result is then used as a starting point to develop new algorithms that use contextual information to increase the performance in automatic keyphrase extraction. The first algorithm that we present uses new linguistics features based on anaphora resolution, which is a field of natural language processing that exploits the relations between elements of the discourse as, e.g., pronouns. We evaluate several supervised AKE pipelines based on these features on the well-known SEMEVAL 2010 dataset, and we show that the performance increases when we add such features to a model that employs statistical and positional knowledge only. Finally, we investigate the possibilities offered by the field of Deep Learning, by proposing six different deep neural networks that perform automatic keyphrase extraction. Such networks are based on bidirectional long-short term memory networks, or on convolutional neural networks, or on a combination of both of them, and on a neural language model which creates a vector representation of each word of the document. These networks are able to learn new features using the the whole document when extracting keyphrases, and they have the advantage of not needing a corpus after being trained to extract keyphrases from new documents. We show that with deep learning based architectures we are able to outperform several other keyphrase extraction algorithms, both supervised and not supervised, used in literature and that the best performances are obtained when we build an additional neural representation of the input document and we append it to the neural language model. Both the anaphora-based and the deep-learning based approaches show that using contextual information, the performance in supervised algorithms for automatic keyphrase extraction improves. In fact, in the methods presented in this thesis, the algorithms which obtained the best performance are the ones receiving more contextual information, both about the relations of the potential keyphrase with other parts of the document, as in the anaphora based approach, and in the shape of a neural representation of the input document, as in the deep learning approach. In contrast, the approach of using statistical and positional knowledge only allows the building of language agnostic keyphrase extraction algorithms, at the cost of decreased precision and recall.

Advances in Automatic Keyphrase Extraction / Marco Basaldella - : . , 2018 Feb 26. ((30. ciclo, Anno Accademico 2016/2017.

### Advances in Automatic Keyphrase Extraction

#### Abstract

The main purpose of this thesis is to analyze and propose new improvements in the field of Automatic Keyphrase Extraction, i.e., the field of automatically detecting the key concepts in a document. We will discuss, in particular, supervised machine learning algorithms for keyphrase extraction, by first identifying their shortcomings and then proposing new techniques which exploit contextual information to overcome them. Keyphrase extraction requires that the key concepts, or \emph{keyphrases}, appear verbatim in the body of the document. We will identify the fact that current algorithms do not use contextual information when detecting keyphrases as one of the main shortcomings of supervised keyphrase extraction. Instead, statistical and positional cues, like the frequency of the candidate keyphrase or its first appearance in the document, are mainly used to determine if a phrase appearing in a document is a keyphrase or not. For this reason, we will prove that a supervised keyphrase extraction algorithm, by using only statistical and positional features, is actually able to extract good keyphrases from documents written in languages that it has never seen. The algorithm will be trained over a common dataset for the English language, a purpose-collected dataset for the Arabic language, and evaluated on the Italian, Romanian and Portuguese languages as well. This result is then used as a starting point to develop new algorithms that use contextual information to increase the performance in automatic keyphrase extraction. The first algorithm that we present uses new linguistics features based on anaphora resolution, which is a field of natural language processing that exploits the relations between elements of the discourse as, e.g., pronouns. We evaluate several supervised AKE pipelines based on these features on the well-known SEMEVAL 2010 dataset, and we show that the performance increases when we add such features to a model that employs statistical and positional knowledge only. Finally, we investigate the possibilities offered by the field of Deep Learning, by proposing six different deep neural networks that perform automatic keyphrase extraction. Such networks are based on bidirectional long-short term memory networks, or on convolutional neural networks, or on a combination of both of them, and on a neural language model which creates a vector representation of each word of the document. These networks are able to learn new features using the the whole document when extracting keyphrases, and they have the advantage of not needing a corpus after being trained to extract keyphrases from new documents. We show that with deep learning based architectures we are able to outperform several other keyphrase extraction algorithms, both supervised and not supervised, used in literature and that the best performances are obtained when we build an additional neural representation of the input document and we append it to the neural language model. Both the anaphora-based and the deep-learning based approaches show that using contextual information, the performance in supervised algorithms for automatic keyphrase extraction improves. In fact, in the methods presented in this thesis, the algorithms which obtained the best performance are the ones receiving more contextual information, both about the relations of the potential keyphrase with other parts of the document, as in the anaphora based approach, and in the shape of a neural representation of the input document, as in the deep learning approach. In contrast, the approach of using statistical and positional knowledge only allows the building of language agnostic keyphrase extraction algorithms, at the cost of decreased precision and recall.
##### Scheda breve Scheda completa Scheda completa (DC)
automatic; keyphrase; extraction; machine; learning
automatic; keyphrase; extraction; machine; learning
Advances in Automatic Keyphrase Extraction / Marco Basaldella - : . , 2018 Feb 26. ((30. ciclo, Anno Accademico 2016/2017.
File in questo prodotto:
File
Basaldella_tesi definitiva.pdf

accesso aperto

Descrizione: tesi di dottorato
Dimensione 2.89 MB
Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11390/1143008