We address the problem of localizing a speaker and enhancing his voice using audiovisual sensors installed on a multirotor micro aerial vehicle (MAV). Acoustic-only localization and signal enhancement through beamforming techniques is especially challenging in this conditions, due to the nature and intensity of disturbances originated by the electrical engines and the propellers. We propose a solution in which an efficient beamforming-based algorithm for both localization and enhancement of the source is paired to a video-based human face detection. The video processing front-end detects the human silhouettes and provides an estimation of direction of arrivals (DOAs) on the array. When the acoustic localization front-end detects a speech activity originating from one of the possible directions estimated by the visual components, the acoustic source localization is refined and the recorded signal is enhanced through acoustic beamforming. The proposed algorithm was tested on a MAV equipped with a compact uniform linear array (ULA) of four microphones. A set of scenes featuring two human subjects lying in the field of view and speaking one at a time is analyzed through this method. The experimental results conducted in stable hovering conditions are illustrated, and the localization and signal enhancing performances are analyzed.
Audiovisual active speaker localization and enhancement for multirotor micro aerial vehicles
Daniele Salvati;Carlo Drioli;Andrea Gulli;Gian Luca Foresti;Federico Fontana;Giovanni Ferrin
2019-01-01
Abstract
We address the problem of localizing a speaker and enhancing his voice using audiovisual sensors installed on a multirotor micro aerial vehicle (MAV). Acoustic-only localization and signal enhancement through beamforming techniques is especially challenging in this conditions, due to the nature and intensity of disturbances originated by the electrical engines and the propellers. We propose a solution in which an efficient beamforming-based algorithm for both localization and enhancement of the source is paired to a video-based human face detection. The video processing front-end detects the human silhouettes and provides an estimation of direction of arrivals (DOAs) on the array. When the acoustic localization front-end detects a speech activity originating from one of the possible directions estimated by the visual components, the acoustic source localization is refined and the recorded signal is enhanced through acoustic beamforming. The proposed algorithm was tested on a MAV equipped with a compact uniform linear array (ULA) of four microphones. A set of scenes featuring two human subjects lying in the field of view and speaking one at a time is analyzed through this method. The experimental results conducted in stable hovering conditions are illustrated, and the localization and signal enhancing performances are analyzed.File | Dimensione | Formato | |
---|---|---|---|
official.pdf
accesso aperto
Descrizione: Articolo principale
Tipologia:
Documento in Post-print
Licenza:
Creative commons
Dimensione
1.34 MB
Formato
Adobe PDF
|
1.34 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.