Exploiting CNNs for Improving Acoustic Source Localization in Noisy and Reverberant Conditions

This paper discusses the application of convolutional neural networks (CNNs) to minimum variance distortionless response localization schemes. We investigate the direction of arrival estimation problems in noisy and reverberant conditions using a uniform linear array (ULA). CNNs are used to process the multichannel data from the ULA and to improve the data fusion scheme, which is performed in the steered response power computation. CNNs improve the incoherent frequency fusion of the narrowband response power by weighting the components, reducing the deleterious effects of those components affected by artifacts due to noise and reverberation. The use of CNNs avoids the necessity of previously encoding the multichannel data into selected acoustic cues with the advantage to exploit its ability in recognizing geometrical pattern similarity. Experiments with both simulated and real acoustic data demonstrate the superior localization performance of the proposed SRP beamformer with respect to other state-of-the-art techniques.


I. INTRODUCTION
M ULTICHANNEL audio processing techniques have been broadly investigated in teleconferencing systems, audio surveillance, autonomous robots, human-computer interaction, and have a central role in a number of applications related to the acoustic analysis and speech technology area. Within the research on acoustic sensor arrays, spatial localization of acoustic sources and active speakers has certainly received large attention, and baseline techniques are now available that offer appreciable performances in a wide number of real-world conditions, including indoor/outdoor scenarios, reverberant and noisy environment, near-field/far-field monitoring [1]- [8]. In general, the localization can be performed by indirect and direct methods. The indirect (two-step) approach computes a set of time difference of arrivals (TDOAs) using measurements across various combinations of microphones [9], [10], and then estimates the source position using geometric considerations [11], [12]. Direct methods are based on the steered response power (SRP) beamformers [1], [13], [14], on subspace algorithms [15]- [17], or on maximum-likelihood estimators [18]- [20].
Most of the aforementioned methods can be designed to act selectively on a limited frequency range (narrowband beamformer), while their broadband frequency range version can be obtained by fusing the narrowband components in an opportune manner. In this paper, we propose a SRP scheme which employs convolutional neural networks (CNNs) [21], [22] to refine the frequency-domain multichannel fusion operation of the minimum variance distortionless response (MVDR) beamformer [23] by learning how to opportunely weight the narrowband components. It is shown that this approach improves the localization of acoustic sources and speakers in noisy and reverberant conditions. The novel convolutional-based scheme also contributes to better investigate the structure of acoustic cues from the multichannel spectral densities. In the SRP-weighted MVDR (SRP-WMVDR) schemes proposed previously in [24]- [26], these computations relied on a preprocessing stage in which the multichannel acoustic input was transformed into a set of cues serving as input to the machine learning component. The idea of selecting or weighting the MVDR components was proposed in [24], where a radial basis function network (RBFN) was used as narrowband frequency components classifier, using the marginal distribution of the narrowband components as input. The approach in [24] was extended in [25], [26], in which a support vector machine (SVM) learning component was used. This scheme, which used a different set of input features based on marginal distributions of the acoustic data, proved to outperform the RBFN-based one.
We extend here the hybrid beamforming-plus-machine learning approach to the use of CNNs, whose principal advantage is to avoid the explicit selection and computation of a set of acoustic cues since these are effectively computed by the convolutional layers of the network. With respect to previous research in the field, the paper addresses for the first time the exploitation of convolutional features in the context of multichannel audio processing for acoustic source localization purposes. We study the integration of CNNs in the signal processing chain on which the acoustic localization problem is based and present a new algorithm, referred in the following to as SRP-WMVDR-CNN.
The algorithm is presented in two variants, the first one based on a classification CNN, and the second one based on a regression CNN. In the classification-oriented scheme, the CNN is trained to classify the narrowband SRPs into two classes: constructively contributing SRPs vs. disruptively contributing ones. In the information fusion step, which sums up the con-tribution of each narrowband SRP, the latter are discarded. In the regression-oriented scheme, the CNN is trained to provide the weighting coefficients of an improved SRP fusion function, which weights the contribution of each narrowband SRP while adding it to the sum of contributions. As for the acoustic setting, we consider the far-field direction of arrival (DOA) estimation problem of a single source in noisy and reverberant conditions, using a uniform linear array (ULA).
Applications of this scenario include videoconferencing systems [27], in which the estimation of sound coordinates can be used to automatically steer a videocamera towards an active speaker; human-computer interaction systems [28], in which localization and beamforming are used to enhance the signal and improve audio recognition; or even multimedia interactive systems for performing arts, in which acoustic source localization can be integrated into digital musical interfaces and used for performance control [29]. The method can be extended in principle to a multiple-source scenario, which would require to improve the routine devoted to peak searching in the acoustic response power.

A. Conventional Methods for DOA Estimation
The DOA estimation problem concerns the processing of acoustic data collected by a microphone array with the aim of obtaining information on the direction from which the acoustic source signal originates. At today, the methods for DOA estimation can be broadly classified in two classes: TDOA-based indirect methods, and direct methods. The indirect methods aim at estimating the time difference of the acoustic wavefront arrivals between microphone pairs and then the DOA using geometric considerations [10], [30]- [33]. The generalized cross-correlation (GCC) [9] is considered a baseline practical method for TDOA estimation, but often improved versions are used in practice. The multichannel cross-correlation coefficient (MCCC) [34], for example, is based on TDOAs estimation obtained by the GCC paired with a prediction of the spatial error to provide a more robust estimate of the DOA. Direct methods, on the other hand, estimate the DOA of an acoustic source in a single step by exploiting some power density function representing the spatially-relevant information distribution, and they are considered in general more robust under noisy and reverberation conditions if compared to the TDOA-based methods. The SRP localization involves computing the output power of a beamformer steered towards each DOA of interest. The conventional SRP is performed with the delay and sum beamformer [35], which consists in the synchronization of the array signals to steer the array in a certain direction, and of summing the signals to estimate the power of the spatial filter. The SRP phase transform (SRP-PHAT) [13] is a widely used filtered SRP beamforming. The PHAT filter [9] assigns equal importance to each frequency by dividing the spectrum by its magnitude. The SRP-PHAT can be efficiently computed by the global coherent field (GCF) [36] approach, that coherently sums the GCC-PHAT from the microphone pairs for each possible point of interest. Among conventional beamformers, the MVDR [23] filter is a well-known data-dependent beamformer that provides better resolution if compared to the conventional beamformer. Both MVDR and SRP localization have been described as maximum-likelihood problems in [18]- [20]. Yet another class of high resolution methods is based on subspace analysis and decomposition. The multiple signal classification (MUSIC) method [15] exploits the subspace orthogonality property to build the spatial spectrum and to localize the DOA sources. The estimation of signal parameters via rotational invariance techniques (ESPRIT) is also based on subspace decomposition exploiting the rotational invariance [16], [37].

B. Machine Learning Methods for Multichannel Processing
Since many decades, machine learning and neural network methods have been successfully employed in a wide range of speech and audio processing applications, such as automatic speech recognition (ASR) [38] - [41], audio forensic [42], music information retrieval [43], [44], sound classification [45]. However, their use for the improvement or the new design of multichannel processing localization schemes has been explored only recently [25], [26], [46], [47]. Moreover, since the new computational and performance advances brought by the recent developments in the field of deep neural networks (DNNs) research, their use is now being investigated in a variety of acoustic and speech oriented applications involving multichannel processing, including in a few cases the specific problem of acoustic source localization. To date, the application of DNNs to multichannel processing problems has focused principally on ASR [28], [48], speech enhancement [49], acoustic source separation [50], and acoustic source localization [51]. In [28], a DNN-based feature enhancement method using multichannel inputs is proposed for robust ASR. The multichannel information is used in the pre-enhanced spectral features that are obtained by DOAconstrained independent component analysis. In [49], multichannel speech enhancement is addressed, and beamforming based enhancement is achieved by time-frequency (T-F) masking. The algorithm combines single-and multi-microphone processing, in which a DNN is trained to map the spectral features to a T-F mask, which is used in turn to calculate the noise covariance matrix and the steering vectors related to the speaker position. The steering vectors are then used to enhance the speech signal coming from the speaker position through an MVDR beamformer. Based on these steps, the method iterates masking and beamforming, and its application to ASR shows improved performance over state-of-the-art recognition. Note that T-F masking beamforming has been previously addressed by supervised and unsupervised machine learning methods [46], [47]. In [46], a mask is obtained by an unsupervised spatial vector clustering. A speech spectral model based on a complex Gaussian mixture model is designed to estimate the T-F masks and the steering vectors related to the speaker position.
While in the aforementioned cases source localization is subordinate to other signal processing tasks, such as ASR or speech enhancement, the research in [51] especially addresses the localization problem of a single sound source. This approach is based on a discriminative machine learning to compute the location estimator in the frequency domain, in which a DNN encodes the steering vectors by applying the orthogonality principle used in the MUSIC method [15]. The eigenvectors of power spectral density matrices are treated as the input vector by constructing directional image activators, whose relationships with the source DOAs are learned in turn by a DNN. Unfortunately, the authors state that their DNN-based method resulted ineffective in noisy and reverberant conditions, and did not resulted in significant localization performance improvements.
Recently, we have discussed a scheme which employs a machine learning component to refine the multichannel fusion scheme and improves the localization of acoustic sources and speakers in near-field noisy and reverberant conditions [24], [25] and far-field noisy condition [26]. These investigations underline the importance of the way in which broadband fusion of narrowband components is performed, and the usefulness of exploiting the knowledge on which components contribute constructively to the localization and which do not. These computation schemes rely, however, on a preprocessing stage in which the multichannel acoustic input is transformed into a set of cues serving as input to the machine learning component (i.e., the skewness, the kurtosis, the crest factor, and the marginal distribution of the acoustic input data were used). We extend here the hybrid beamforming-plus-machine learning approach to the use of CNNs, whose principal advantage is to exploit its ability in recognize geometrical pattern similarity and to avoid the explicit selection and computation of a set of acoustic cues since these are effectively computed in the CNN layers.

III. ACOUSTIC SOURCE LOCALIZATION BASED ON CNNS
The signal processing pipeline structure is illustrated in Fig. 1. The middle-part of the scheme describes the acoustics based processing steps, including the short-time Fourier transform (STFT) of the multichannel input s m (t) (m = 1, 2, . . . , M, where M is the number of microphones), the frequency bin-dependent narrowband SRP, P (f, θ) (where f is the frequency bin and θ is the DOA), and the enhanced fusion step used to build the final broadband acoustic map, P CNN (θ), by exploiting the weighting information provided by the CNN output. The input to the CNNs is provided by the narrowband SRP components. The lower part of the scheme contains a convolutional layer and a pooling layer, followed by an output layer, i.e. a classification or regression fully connected NN layer. The upper processing path in Fig. 1 shows the SVM based scheme proposed in [25], [26], and used here only for comparison. Note that the upper SVM processing path is sketched with dashed lines to emphasize the fact that it is not part of the presently proposed algorithm, and it does not used in conjunction with the lower CNN based processing path.

A. Acoustic Localization Elements
Beamforming methods search for the maximum of the SRP functions computed from the output of the sensor array. Straightforward calculation can be achieved through a delay-and-sum procedure in the time domain [35]. However, for computational efficiency the broadband SRP is typically computed in the frequency-domain by calculating the power spectral density (PSD) matrix and the narrowband SRP on each frequency bin, and by finally fusing these narrowband responses. The PSD at frequency f for the looking direction θ 1 can be written as where w(f, θ) is the weighting and steering vector, is the sensor array output in the frequency domain, H denote the conjugate transpose, and is the symmetric and positive definite PSD matrix (E[·] denotes here the mathematical expectation). Throughout this paper, we will make use of the specific class of MVDR beamformers [23], whose narrowband PSD has the following form: where a(f, θ) is the steering vector, i.e. the set of phase delays affecting a plane wave when it reaches each sensor in the array.
In the far-field, the array steering vector is defined for the ULA as where L is the size of the DTFT, j is the imaginary unit, and (n − 1)τ (θ) is time difference of arrival (TDOA) between the nth and reference microphone. The relationship between the TDOA τ (θ) and the DOA θ is given by where c is the speed of sound and d is the inter-microphone distance of the ULA. The fusion of these narrowband PSDs to obtain the SRP-MVDR beamformer is then computed as the sum of all the frequency bin components, i.e.
Usually, however, some sort of normalization is operated on the components before the fusion, since the normalization has the beneficial effect of increasing the spatial resolution of the beamformer. Example of such beamformers are the delay-and-sum SRP phase transform (SRP-PHAT) [13], in which the normalization is achieved discarding the magnitude and only keeps the phase of the PSD matrix, or the SRP normalized MVDR (SRP-NMVDR) [52], where each PSD component is normalized by the maximum value of the PSDs for that frequency with respect to all possible DOAs. The normalization is known to improve the spatial resolution of the beamformer, however it also emphasizes the noise at those frequency components with low signal-to-noise ratio (SNR), causing localization errors and performance degradation.
To avoid to use the disruptive information provided by such components, especially for localization in reverberant and noisy environments, the narrowband SRP components are weighted in the fusion process [24]- [26]. In order to improve the localization performance, we further extend here this weighting concept, and define the SRP output as a weighted sum of narrowband components: The DOA estimation of the acoustic source is computed by a maximum search procedure on the P CNN (θ) function, i.e.
In the new scheme based on CNNs, the input features to the classification or regression layer are computed by the convolutional components, thus avoiding the problem of searching for the best feature class, and providing a new class of features especially suited for the specific task. Moreover, in the regression-CNN configuration, the information provided by a given frequency component is weighted in order to prevent the use of incorrect information taking into account the errors of machine learning component, in particularly when the training and testing conditions considerably change.

B. CNN-Based Component
The overall structure of the CNN component is made by a convolution-pooling hidden layer, followed by a fully-connected layer. The input to the CNN is provided by the low-level narrowband normalized SRP P (f, θ), which is encoded as a b/w image V as follows exploiting the ability of CNNs to recognize geometrical similarity patterns without being affected by their position nor by small distortions of their shapes [21], [22]. For a given inter-microphone distance d, the set of distinct discrete TDOA of τ (θ) values will have cardinality T = 2 df s c +1, where · denotes the floor function that maps a real number to the largest previous integer and f s is the sampling frequency. Therefore, we have that the input matrix V will have dimension T × T , its element v ij , i = 1, 2 . . . , T , will be set to 255 if j = P (f, i)T , otherwise it will be set to 0. This operation allows to transform the mono-dimensional output power of the ULA into a two-dimensional input, encoding an image-like representation of the SRP function, thus emphasizing the shapeoriented nature of the processing which occurs in the subsequent CNN layers. Note that by using the mono-dimensional input we cannot identify the shape of SRPs. Fig. 2 shows some examples of the input V, each one representing a narrowband SRP at a different frequency, for an ULA with inter-microphone distance of 0.15 m and a sample frequency of 48 kHz. The frequencies were chosen arbitrarily among those classified positively (upper plots) and those classified negatively (lower plots) for that frame.
This input raw data undergoes a filtering and activation detection step operated through the convolutional layer kernel W, as where W is a trained kernel, b is a bias parameter, and σ(·) is the activation function. We use here the rectified linear unit (ReLU) [53] for generating the output of the convolutional layer. The bias guarantees that every node has a trainable constant value. The kernels are computed through a stochastic gradient descent method [54], which minimizes a loss function measuring the discrepancy between the CNN prediction and the target. The loss function for classification is cross entropy [55] and for regression is mean squared error. Next, the pooling layer operates a dimensionality reduction through an averaging or maximizing operation with respect to the two dimensions of the feature. In this work, we adopt a max-pooling layer [56]. The output of the convolutional-pooling layer is then used as the input of the final fully connected layer, in which each neuron is connected to all neurons of the previous layer. The CNN must be trained using a supervised procedure, based on a set of known target θ t DOAs. This step is achieved by computing the contribution of each frequency component to the global localization error. If is the source DOA estimate based only on the component related to frequency f , the contribution of this frequency to the localization error is The localization error is then used to build the output training values of the CNN model, as follows: 1) Classifier-based Configuration: In the classifier-based configuration, the last fully connected layer combines convolutional features to classify the input as 0 or 1. The activation function used in fully connected classification layer is the softmax function [57]. The classifier is trained to remove those narrowband components which contribute negatively to the localization. Namely, consider the i-th input V i , the i-th training set output γ c i (f ) is set as where η is a given threshold.

2) Regression-Based Configuration:
In the regression-based configuration, the output variable is continuous in the range [0,1] and the i-th training set output γ r i (f ) is set as (11) Hence, we have that the contribution of positively narrowband SRP is weighted as a quadratic function of narrowband localization error. The activation function used for the fully connected regression layer is the mean squared error.
The choice of the parameter η is crucial for a good training. In general, we aim at selecting a value that allows a balanced number of positively and negatively contributing maps on the whole training set. A very small value has the effect of providing a small number of positively contributing maps. On the other hand, large values of η may have the effect of allowing some disruptive narrowband components to take part in the fusion [25], [26]. In [25], it has been demonstrated that a value in the range 0.3-0.6 m is a good choice for the near-field. In [26], it was successfully used a value of 3 degrees for a far-field noisy condition. In this work, we have empirically found that a value of 5 degrees provides satisfactory results for the far-field noisy and reverberant case.

IV. EXPERIMENTS AND RESULTS
In this section, the performance of the CNN-based localization schemes (classification and regression) is assessed by addressing a 2D source localization task in the far-field scenario (DOA estimation). The multichannel noisy and reverberant acoustic data used in the first experimental setup were obtained by numerical simulation of the room acoustics, whereas the data used in the second experimental setup are actual multichannel recordings of an acoustic source located in reverberant environments. The performance of the proposed SRP-WMVDR-CNN methods is assessed in terms of the localization accuracy rate (AR) for a threshold error of 5 degrees and the root mean square error (RMSE), and compared with the SRP-WMVDR-SVM [25], [26], the SRP-NMVDR [52], and the SRP-PHAT [13]. In the SRP-WMVDR-SVM beamformer scheme, the weighting factors of narrowband MVDR response are estimated with an SVM supervised model defined as where Q is the training sample size, ψ(ς i , ς(f )) is the innerproduct kernel for the i-th training sample input ς i and the sample input ς(f ) for the narrowband PSD at frequency f , γ i is the i-th target value so that it takes values {1, −1}, α i ≥ 0, and b is a real constant. The parameters α i are found as usual by solving a convex maximization quadratic programming problem. The skewness of the normalized narrowband PSDs is taken as input to the classifier. The radial basis function kernel was adopted for the SRP-WMVDR-SVM by setting λ = 1 and σ = 1 using a cross-validation in accordance to [25], [26]. The sample frequency was 48 kHz and the window size L was 2048 samples. We have set f min and f max to 50 Hz and 15000 Hz, respectively. A diagonal loading regularization [58] was used for the narrowband MVDR filter to improve the robustness of the SRP. The PSD matrix is estimated using an averaging of 10 snapshots in all methods. The inter-microphone distance of the ULA is 0.15 m, resulting in an angular discretization of N=41 samples. Hence, the input matrix V has dimension 41 × 41. In our CNN configuration, we used 20 convolutional kernels with a size of 5 × 5, since it allows a simple structure balancing the recognition accuracy and the overfitting problem. We adopted a max-pooling layer with size 2 × 2. Thus, the feature size is reduced by a factor of four. The parameter η was set to 5 degrees since we have empirically found that it provides satisfactory results for the far-field noisy and reverberant case [25], [26]. The CNN and SVM have been implemented using the Matlab R2017a Neural Network Toolbox and Statistics and Machine Learning Toolbox. We used our own implementation for the MVDR filter. We investigate the generalization properties of the proposed method with respect to three characteristics: 1) the source position (training and testing positions are different in all experiments); 2) the acoustic source nature (training is performed with an USASI signal [59], whereas the testing is performed with speech, impulsive or narrowband signals); 3) the environment characteristics (training and testing are performed in the same room and in different rooms evaluating the localization performance with different noise and reverberant conditions).

A. Simulations
Simulations of reverberant environments were obtained with the image-source method (ISM) [60], implemented using the improved algorithm reported in [61]. The simulations were conducted with different SNR levels, obtained by adding mutually independent white Gaussian noise to each channel.
In the first set of simulations, an ULA of 8 microphones was used. A localization task in a room of 8 m × 4 m × 3 m was considered. The sources and microphones were considered omnidirectional. The room setup is shown in Fig. 3, in which we can see the 24 source positions used in the training phase. The training was performed using an USASI signal with a RT 60 of 0.6 s and a SNR of 20 dB. The same training setup was used for the  Table I shows the results in different noisy and reverberant conditions. As we can observe, the classification CNN provides in general the best performance, although in some noisy condition (−20 dB) the regression CNN outperforms it. In such situation, the binary classification error may discard some useful information and include incorrect ones, while the regression output allows a weighting of narrowband components resulting a more robust localization accuracy. Both proposed CNN methods perform better than all of the other algorithms and provide good generalization performances in all noisy and reverberation conditions with respect to the source position (training and test positions are different) and to the acoustic source nature (training is performed with USASI noise and testing is performed with speech, impulsive and narrowband signals). We can also note that the SVM-based classifier become ineffective in low noise conditions and in some narrowband source cases, confirming the results in [25] and in [26]. Next, we can observe that the localization performance of CNN methods in term of AR and RMSE is better with the narrowband signal if compared to that SRP-NMVDR, since the most of the spectrum of the signal is affected by noise. This performance difference is reduced with the speech and the impulsive signals. Both classification and regression CNN-based methods demonstrate a good robustness when reverberation is increased, as we can note in Table I. When RT 60 is increased, in general the performance gap between CNN and NMVDR increases.
Next, to asses the generalization characteristics with respect to room dimensions, the system trained on data from this room geometry was tested on a set of acoustic data obtained with room dimensions of 7 m × 11 m × 4 m, in which the 8-microphone ULA is positioned. Table II shows the results for the speech signal. We can see that the regression CNN provides the best performance in this case. This result suggests that, when the training and testing room are different, the classification CNN   performance is affected by a larger number of narrowband SRP classification errors, since the room response is changed. In this case, the regression CNN seems to be less sensitive to room geometry differences and provides a better classification performance.
In the second set of simulations, a small configuration ULA of 3 microphones was used. The room setup is shown in Fig. 5. We consider 20 training positions, resulting 13640 input matrices V. The training was performed using an USASI signal with a RT 60 of 0.3 s and a SNR of 20 dB. Two localization tests were performed using the same room configuration of the training (5 m × 4 m × 3 m) and a different room configuration with a size of 6.37 m × 2.98 m × 3.6 m. We consider 50 random source positions (different from the training positions) with a distance from the array in the range 1-3 m. The results with a speech signal are reported in Table III. As we can observe, the simulation confirms the efficiency of CNN-based SRP methods. Specifically, the classification output has a better performance when the room setup is equal to that of the training. On the other hand, the regression output provides a better localization accuracy in a different room setup. In this case, the SVM does not provide any improvement, and the classification CNN tends to provide lower performance when the SNR decreases.
Next, we evaluated the localization performance using a speech signal corrupted by two different types of noise: babble noise (i.e., background noise originating from a large number of simultaneously talking people, as it is typically observed in a cocktail party) and diffuse noise field [62]. The room setup used for this evaluation is depicted in Fig. 3. The RT 60 was set to 0.3 s and the SNR was set to 20 dB. Table IV shows how the classification and regression SRP-WMVDR-CNN methods perform better that the other SRP-based methods.
Afterwards, we compared the CNN-based approach with the DNN method described in [51], in which the noise eigenvectors of the power spectral density matrices are used as input to the neural network that, in turn, outputs an estimate of the DOA. The DNN structure is composed by a directional image activators layer, a partially integrated layer and an integrated layer. The DNN was trained using the same setup used for the CNN and shown in Fig. 3. Table V shows the results for two noisy and reverberant conditions, when the training and testing data are organized as in the first experiment. As we can observe, both CNN-based approaches outperform the DNN-based scheme.
We are firmly convinced that the effectiveness of the proposed method lies in the hybrid nature of the processing scheme. To provide a comparison of this solution with a simpler one, in The RT 6 0 was set to 0.3 s and the SNR was set to 20 dB. The best performance is shown in bold text.  which the localization is based solely on the convolutional neural network component, we have implemented an end-to-end scheme in which the phase of the STFT is encoded directly as the input to the CNN, while the CNN output is used to directly encode the DOA value. Thus, a regression configuration was assumed. The system overview of the CNN-based endto-end configuration is shown in Fig. 6. We have performed a simulation with a speech signal and an ULA of 8 microphones in the room setup depicted in Fig. 3 with RT 60 = 0.3 s and SNR = 20 dB. The localization performance is AR = 25.78 % and RMSE = 11.488 degree. When training the model using training data and conditions comparable to the one used for the proposed schemes, the performances of the model resulted extremely poor. Apparently in this processing chain choice, the localization-related information is so overwhelmed by unrelated information, that the amount of training data and training time  required to provide the same performances as an hybrid scheme, would be both much larger. The proposed CNN methods improve the localization performance in noisy and reverberant conditions, compared to other state-of-the-art methods. Specifically, SRP-WMVDR-CNN methods prove to effectively generalize with respect to the source position, the acoustic source nature and the environment characteristics. For the latter, the classifier-based configuration performs better that the regression-based configuration when the training and the test environments are the same. On the other hand, the regression-based configuration proved to be more robust in our tests when the environment characteristics used in the test procedure were different from those used during training.

B. Analysis of the Convolutional Layer Features
To gain further insight into the features learned by the CNNs, we report the high-level features at the output of the fully connected layer. Figs. 7 and 8 show the feature maps that strongly activate the two channels for the 8-microphone and 3-microphone ULA, respectively. It can be seen that the features are characterized by a similar pattern (i.e., energy  Then, we reported the average recognition accuracy of positively contributing maps (1-value classification) and negatively contributing maps (0-value classification) for the classification CNN and the SVM using skewness. The results showed in Table VI confirm the better recognition accuracy of the classification CNN.
To further compare the effectiveness of the convolutional layer features and other specific features such as the skewness mentioned in this section, we also report in Table VII the Fisher's discriminant ratio (FDR) average values for the two choices, the average being referred to all positions for the 3-microphones and for the 8-microphones case. The FDR is defined as the ratio of the between-class scatter matrix to the within-class scatter matrix, and can be employed to quantify the discriminatory power of individual features between classes [63]. A common problem in the computation of the FDR is the high dimensionality of the features with respect to the observation data, which leads to poorly conditioned within-class scatter matrices. To face this issue, we perform a preprocessing data dimensionality reduction based on principal component analysis, which is a commonly accepted solution [64], [65]. The numerical results  9. The narrowband SRP input V using a low-pass random noise with energy up to 1 kHz at variation of SNR level. reported confirm the substantial increment of the discriminatory power of convolutional features if compared to skewness. Last, we further investigated how the CNN-based method improves robustness to noise, by using an ad-hoc low-pass random noise signal as source, with energy up to 1 kHz, corrupted by white uncorrelated noise. The uncorrelated noise energy was gradually increased so that the related SNR spanned the range [30 db, −30 dB]. We considered a ULA of 8 microphones in the room setup depicted in Fig. 3 with RT 60 = 0.3 s, and a source impinges the array with θ s = 10.72 degree.    Fig. 9 shows the SRP function for two specific frequencies: 500 Hz, falling in the range of the source signal spectrum, and 2500 Hz, falling in the range where only uncorrelated noise is present. As we can see, when the SNR level is 30 dB, 0 dB and −15 dB, the DOA source is correctly estimated for the 500 Hz frequency and the SRP functions are correctly classified as 1. When the SNR level is −30 dB, the SRP is classified as 0, and, hence, the SRP is removed in the fusion process, since it does not contribute anymore to correctly localize the source. For the 2500 Hz case, in all noise conditions the SRP are classified as 0, and the noise components were removed in the classification scheme or removed/attenuated in the regression one.

C. Real Data
The experiments based on multichannel recorded data were performed in an office room of 6.37 m × 2.98 m × 3.6 m with a RT 60 of 0.6 s and in a conference room of 16 m × 7 m × 3 m with a RT 60 of 0.9 s.
In the first experiment, an ULA of 3 microphones was used in the office room. We performed the localization with the training configuration used in the simulation with 3 microphones. The room setup is shown in Fig. 10. Both microphones and the source were positioned at a distance from the floor of 0.9 m. A speech signal of 25 s duration from a male speaker was reproduced with a loudspeaker in the positions depicted in Fig. 10. We estimated an average SNR of 15 dB at microphones. The SNR was computed by estimating the average speech energy vs the average noise energy (the latter is estimated from signal fragments where the speaker is not active). The results are reported in Table IX. We can note that only the regression CNN-based method improves the performance if compared to SRP-NMVDR, while both binary classification (classification CNN and SVM) fails in such case. This fact confirms that the generalization of the classification CNN is more difficult due to far-field reverberant conditions, in which the reflection components have a larger impact on SRP computation in comparison to the nearfield condition [25]. This result can be compared to that reported in Table III with the same room noise and reverberant conditions (the lower part). We can note that the real results have a greater RMSE with a better AR, due to the limited number of positions used in this experiment.
In the second experiment, an ULA of 8 microphones was used in the conference room. We performed the localization with the training configuration used in the simulation with 8 microphones. The room setup is shown in Fig. 11. Both microphones and the source were positioned at a distance from the floor of 1.7 m. Three sessions were recorded using short sentences uttered by two male and one female speakers, standing up at different positions depicted in Fig. 11. The results reported in Table X confirm the good performance of the regression CNN-based method. In this experiment also the classification configuration has a better performance if compared to that of the SRP-NMVDR. This fact is due to a better robustness to noise of the 8-microphone ULA in comparison to a small 3-microphone ULA.

V. CONCLUSIONS
A WMVDR beamformer based on a CNN deep learning has been presented. It improves the localization accuracy in a single source scenario without point-source interferences. The results show that CNNs improve the incoherent frequency fusion of the narrowband response power by weighting the components in such a way as to reduce the deleterious effects of those components affected by artifacts due to noise and reverberation. The use of CNNs avoids the necessity of previously encoding the multichannel data into selected acoustic cues. We implemented the CNNs in two versions, one with a classification output layer, and the other with a regression output layer. Our experiments demonstrate that the CNN is robust to noise and reverberation in comparison to the state-of-the-art. Specifically, the classification CNN has a better performance when the training and test condition setup are the same (i.e, same room and array position). On the other hand, the regression CNN provides a better localization accuracy, due to its robustness against classification errors that may occur when training data and test data are referred to different acoustic conditions. The proposed method has been compared to other two possible approaches based on a neural network component. An end-to-end CNN scheme, and a DNN model proposed in the literature. In both cases, the proposed method provided superior performances. Our explanation for that is that the method exploits the hybrid nature of the processing scheme, in which the CNN component is integrated with a simple but effective information fusion model rooted on acoustic principles.
A number of issues remain to be investigated, and will be the subject of future work. In the present study, frequency components in the training set were selected as positive or negative by using a frequency independent, empirically selected threshold. This approach might be improved by investigating if different components might be more or less relevant to the localization depending on their frequency, and if this has any perceptual basis. Moreover, in future refinements of this class of signal processing paths, the machine learning components might be trained to both improve the fusion model (as done in the present case) while also contributing to recognize spectral/temporal characteristics of the acoustic sources, and distinguish for example between speech, music, or ecological sounds. In this case, they could be successfully used for effective multisource localization, or might be trained to distinguish between actual and image sources in reverberant environments.