Semantics for vision-and-language understanding

IRIS

Recent advancements in Artificial Intelligence have led to several breakthroughs in many heterogeneous scientific fields, such as the prediction of protein structures or self-driving cars. These results are obtained by means of Machine Learning techniques, which make it possible to automatically learn from the available annotated examples a mathematical model capable of solving the task. One of its sub-fields, Deep Learning, brought further improvements by providing the possibility to also compute an informative and non-redundant representation for each example by means of the same learning process. To successfully solve the task under analysis, the model needs to overcome the generalization gap, meaning that it needs to work well both on the training data, and on examples which are drawn from the same distribution but are never observed at training time. Several heuristics are often used to overcome this gap, such as the introduction of inductive biases when modeling the data or the usage of regularization techniques; however, a popular way consists in collecting and annotating more examples hoping they can cover the cases which were not previously observed. In particular, recent state-of-the-art solutions use hundreds of millions or even billions of annotated examples, and the underlying trend seems to imply that the collection and annotation of more and more examples should be the prominent way to overcome the generalization gap. However, there are many fields, e.g. medical fields, in which it is difficult to collect such a large amount of examples, and producing high quality annotations is even more arduous and costly. During my Ph.D. and in this thesis, I designed and proposed several solutions which address the generalization gap in three different domains by leveraging semantic aspects of the available data. In particular, the first part of the thesis includes techniques which create new annotations for the data under analysis: these include data augmentation techniques, which are used to compute variations of the annotations by means of semantics-preserving transformations, and transfer learning, which is used in the scope of this thesis to automatically generate textual descriptions for a set of images. In the second part of the thesis, this gap is reduced by customizing the training objective based on the semantics of the annotations. By means of these customizations, a problem is shifted from the commonly used single-task setting to a multi-task learning setting by designing an additional task, and then two variations of a standard loss function are proposed by introducing semantic knowledge into the training process.

Semantics for vision-and-language understanding / Alex Falcon , 2023 Mar 13. 35. ciclo, Anno Accademico 2021/2022.

Semantics for vision-and-language understanding

FALCON, ALEX

2023-03-13

Abstract

Recent advancements in Artificial Intelligence have led to several breakthroughs in many heterogeneous scientific fields, such as the prediction of protein structures or self-driving cars. These results are obtained by means of Machine Learning techniques, which make it possible to automatically learn from the available annotated examples a mathematical model capable of solving the task. One of its sub-fields, Deep Learning, brought further improvements by providing the possibility to also compute an informative and non-redundant representation for each example by means of the same learning process. To successfully solve the task under analysis, the model needs to overcome the generalization gap, meaning that it needs to work well both on the training data, and on examples which are drawn from the same distribution but are never observed at training time. Several heuristics are often used to overcome this gap, such as the introduction of inductive biases when modeling the data or the usage of regularization techniques; however, a popular way consists in collecting and annotating more examples hoping they can cover the cases which were not previously observed. In particular, recent state-of-the-art solutions use hundreds of millions or even billions of annotated examples, and the underlying trend seems to imply that the collection and annotation of more and more examples should be the prominent way to overcome the generalization gap. However, there are many fields, e.g. medical fields, in which it is difficult to collect such a large amount of examples, and producing high quality annotations is even more arduous and costly. During my Ph.D. and in this thesis, I designed and proposed several solutions which address the generalization gap in three different domains by leveraging semantic aspects of the available data. In particular, the first part of the thesis includes techniques which create new annotations for the data under analysis: these include data augmentation techniques, which are used to compute variations of the annotations by means of semantics-preserving transformations, and transfer learning, which is used in the scope of this thesis to automatically generate textual descriptions for a set of images. In the second part of the thesis, this gap is reduced by customizing the training objective based on the semantics of the annotations. By means of these customizations, a problem is shifted from the commonly used single-task setting to a multi-task learning setting by designing an additional task, and then two variations of a standard loss function are proposed by introducing semantic knowledge into the training process.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di discussione
	
				13-mar-2023
			
	Parole chiave
	
				video and language; deep learning; multimedia
			
	Citazione
	
				Semantics for vision-and-language understanding / Alex Falcon , 2023 Mar 13. 35. ciclo, Anno Accademico 2021/2022.
			
	Appare nelle tipologie:
	
				8.1 Tesi di Dottorato

File in questo prodotto:

File	Dimensione	Formato
PhD_thesis-4.pdf accesso aperto Descrizione: Tesi (versione post minor-revision) Licenza: Creative commons Dimensione 14.27 MB Formato Adobe PDF Visualizza/Apri	14.27 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1252364

Citazioni

ND

ND

ND

social impact