We address the problem of localizing a speaker and enhancing his voice using audiovisual sensors installed on a multirotor micro aerial vehicle (MAV). Acoustic-only localization and signal enhancement through beamforming techniques is especially challenging in this conditions, due to the nature and intensity of disturbances originated by the electrical engines and the propellers. We propose a solution in which an efficient beamforming-based algorithm for both localization and enhancement of the source is paired to a video-based human face detection. The video processing front-end detects the human silhouettes and provides an estimation of direction of arrivals (DOAs) on the array. When the acoustic localization front-end detects a speech activity originating from one of the possible directions estimated by the visual components, the acoustic source localization is refined and the recorded signal is enhanced through acoustic beamforming. The proposed algorithm was tested on a MAV equipped with a compact uniform linear array (ULA) of four microphones. A set of scenes featuring two human subjects lying in the field of view and speaking one at a time is analyzed through this method. The experimental results conducted in stable hovering conditions are illustrated, and the localization and signal enhancing performances are analyzed.

Audiovisual active speaker localization and enhancement for multirotor micro aerial vehicles

Daniele Salvati;Carlo Drioli;Andrea Gulli;Gian Luca Foresti;Federico Fontana;Giovanni Ferrin
2019-01-01

Abstract

We address the problem of localizing a speaker and enhancing his voice using audiovisual sensors installed on a multirotor micro aerial vehicle (MAV). Acoustic-only localization and signal enhancement through beamforming techniques is especially challenging in this conditions, due to the nature and intensity of disturbances originated by the electrical engines and the propellers. We propose a solution in which an efficient beamforming-based algorithm for both localization and enhancement of the source is paired to a video-based human face detection. The video processing front-end detects the human silhouettes and provides an estimation of direction of arrivals (DOAs) on the array. When the acoustic localization front-end detects a speech activity originating from one of the possible directions estimated by the visual components, the acoustic source localization is refined and the recorded signal is enhanced through acoustic beamforming. The proposed algorithm was tested on a MAV equipped with a compact uniform linear array (ULA) of four microphones. A set of scenes featuring two human subjects lying in the field of view and speaking one at a time is analyzed through this method. The experimental results conducted in stable hovering conditions are illustrated, and the localization and signal enhancing performances are analyzed.
2019
978-3-939296-15-7
File in questo prodotto:
File Dimensione Formato  
official.pdf

accesso aperto

Descrizione: Articolo principale
Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 1.34 MB
Formato Adobe PDF
1.34 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1185243
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact