In recent years the idea of fusing diverse type of information has often been employed to solve various Deep Learning tasks. Whether these regard an NLP problem or a Machine Vision one, the concept of using more inputs of the same type has been the basis of many studies. Considering NLP problems, attempts of different word embeddings have already been tried, managing to make improvements to the most common benchmarks. Here we want to explore the combination not only of different types of input together, but also different data modalities. This is done by fusing two popular word embeddings together, mainly ELMo and BERT, with other inputs that embed a visual description of the analysed text. Doing so, different modalities -textual and visual- are both employed to solve a textual problem, a concreteness task. Multimodal feature fusion is here explored through several techniques: input redundancy, concatenation, average, dimensionality reduction and augmentation. By combining these techniques it is possible to generate different vector representations: the goal is to understand which feature fusion techniques allow to obtain more accurate embeddings.

Multimodal feature fusion for concreteness estimation

Incitti F.
Primo
Writing – Original Draft Preparation
;
Snidaro L.
Secondo
Writing – Review & Editing
2022-01-01

Abstract

In recent years the idea of fusing diverse type of information has often been employed to solve various Deep Learning tasks. Whether these regard an NLP problem or a Machine Vision one, the concept of using more inputs of the same type has been the basis of many studies. Considering NLP problems, attempts of different word embeddings have already been tried, managing to make improvements to the most common benchmarks. Here we want to explore the combination not only of different types of input together, but also different data modalities. This is done by fusing two popular word embeddings together, mainly ELMo and BERT, with other inputs that embed a visual description of the analysed text. Doing so, different modalities -textual and visual- are both employed to solve a textual problem, a concreteness task. Multimodal feature fusion is here explored through several techniques: input redundancy, concatenation, average, dimensionality reduction and augmentation. By combining these techniques it is possible to generate different vector representations: the goal is to understand which feature fusion techniques allow to obtain more accurate embeddings.
2022
978-1-7377497-2-1
File in questo prodotto:
File Dimensione Formato  
Fusion22__Multimodal_feature_fusion_for_concreteness_estimation__Camera_Ready_.pdf

non disponibili

Licenza: Non pubblico
Dimensione 849.32 kB
Formato Adobe PDF
849.32 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1232106
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact