MLSP 2020

IEEE International Workshop on

September 21–24, 2020 Aalto University, Espoo, Finland (virtual conference)

Tuesday, September 22, 2020


Lecture Session 2: Acoustic Signal Processing using Machine Learning

Chair: Sharon Gannot, Bar-Ilan University


A Bayesian hierarchical mixture of Gaussian model for multi-speaker DOA estimation and separation
Yaron Laufer, Sharon Gannot

In this paper we propose a fully Bayesian hierarchical model for multi-speaker direction of arrival (DoA) estimation and separation in noisy environments, utilizing the W-disjoint orthogonality property of the speech sources. Our probabilistic approach employs a mixture of Gaussians formulation with centroids associated with a grid of candidate speakers’ DoAs. The hierarchical Bayesian model is established by attributing priors to the various parameters. We then derive a variational Expectation-Maximization algorithm that estimates the DoAs by selecting the most probable candidates, and separates the speakers using a variant of the multichannel Wiener filter that takes into account the responsibility of each candidate in describing the received data. The proposed algorithm is evaluated using real room impulse responses from a freely-available database, in terms of both DoA estimates accuracy and separation scores. It is shown that the proposed method outperforms competing methods.


Determined audio source separation with multichannel star generative adversarial network
Li Li, Hirokazu Kameoka, Shoji Makino

This paper proposes a multichannel source separation approach, which uses a star generative adversarial network (StarGAN) to model power spectrograms of sources. Various studies have shown the significant contributions of a precise source model to the performance improvement in audio source separation, which indicates the importance of developing a better source model.In this paper, we explore the potential of StarGAN for modeling source spectrograms and investigate the effectiveness of the StarGAN source model in determined multichannel source separation by incorporating it into a frequency-domain independent component analysis (ICA) framework.The experimental results revealed that the proposed StarGAN-based method outperformed conventional methods, which employ non-negative matrix factorization (NMF) or variational autoencoder (VAE) to the source spectrogram modeling.


Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
Ali Aroudi, Marc Delcroix, Tomohiro Nakatani, Keisuke Kinoshita, Shoko Araki, Simon Doclo

The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. In this paper we propose a cognitive-driven multi-microphone speech enhancement system, which combines a neural-network-based mask estimator, weighted minimum power distortionless response convolutional beamformers and AAD. The proposed system allows to enhance the attended speaker and jointly suppress reverberation, the interfering speaker and ambient noise. To control the suppression of the interfering speaker, we also propose an extension incorporating an interference suppression constraint. The experimental results show that the proposed system outperforms the state-of-the-art cognitive-driven speech enhancement systems in reverberant and noisy conditions.


Feature projection-based unsupervised domain adaptation for acoustic scene classification
AlessandroI Mezza, Emanuel Habets, Meinard Müller, Augusto Sarti

The mismatch between the data distributions of training and test data acquired under different recording conditions and using different devices is known to severely impair the performance of Acoustic Scene Classification (ASC) systems. To address this issue, we propose an unsupervised domain adaptation method for ASC based on the projection of spectro-temporal features extracted from both the source and target domain onto the principal subspace spanned by the eigenvectors of the sample covariance matrix of source-domain training data. Using the TUT Urban Acoustic Scenes 2018 Mobile Development dataset we show that the proposed method outperforms state-of-the-art unsupervised domain adaptation techniques when applied jointly with a convolutional ASC model and can also be practically employed as a feature extraction procedure for shallower artificial neural networks.


AutoClip: Adaptive gradient clipping for source separation networks
Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan LeRoux

Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization performance for audio source separation networks. Observation of the training dynamics of a separation network trained with and without AutoClip show that AutoClip guides optimization into smoother parts of the loss landscape. AutoClip is very simple to implement and can be integrated readily into a variety of applications across multiple domains.


Semi-supervised source localization with deep generative modeling
Bianco Michael, Sharon Gannot, Peter Gerstoft

We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.

Lecture Session 3: Bayesian Learning and Modelling

Chair: Michael Riis, Technical University of Denmark


Scalable Gaussian process for extreme classification
Akash Kumar Dhaka, Michael Andersen, Pablo Moreno, Aki Vehtari

We address the limitations of Gaussian processes for multiclass classification in the setting where both the number of classes and the number of observations is very large. We propose a scalable approximate inference framework by combining the inducing points method with variational approximations of the likelihood that have been recently proposed in the literature. This leads to a tractable lower bound on the marginal likelihood that decomposes into a sum over both data points and class labels, and hence, is amenable to doubly stochastic optimization. To overcome memory issues when dealing with large datasets, we resort to amortized inference, which coupled with subsampling over classes reduces the computational load with sparse updates as well as the memory footprint without a significant loss in performance. We demonstrate empirically that the proposed algorithm leads to superior performance in terms of test accuracy, and improved detection of tail labels.


On the effectiveness of two-step learning for latent-variable models
Cem Subakan, Maxime Gasse, Laurent Charlin

Latent-variable generative models offer a principled solution for modeling and sampling from complex probability distributions. Implementing a joint training objective with a complex prior, however, can be a tedious task, as one is typically required to derive and code a specific cost function for each new type of prior distribution. In this work, we propose a general framework for learning latent variable generative models in a two-step fashion. In the first step of the framework, we train an autoencoder, and in the second step we fit a prior model on the resulting latent distribution. This two-step approach offers a convenient alternative to joint training, as it allows for a straightforward combination of existing models without the hustle of deriving new cost functions, and the need for coding the joint training objectives. We demonstrate that two-step learning results in performances similar to joint training, and in some cases even results in more accurate modeling.


Simultaneous intent prediction and state estimation using an intent-driven intrinsic coordinate model
Jiaming Liang, Bashar Ahmad, Simon Godsill

The motion of an object (e.g. ship, jet, pedestrian, bird, drone, etc.) is usually governed by premeditated actions as per an underlying intent, for instance reaching a destination. In this paper, we introduce a novel intent-driven dynamical model based on a continuous-time intrinsic coordinate model. By combining this model with particle filtering, a seamless approach for jointly predicting the destination and estimating the state of a highly manoeuvrable object is developed. We examine the proposed inference technique using real data with different measurement models to demonstrate its efficacy. In particular, we show that the introduced approach can be a flexible and competitive alternative, in terms of prediction and estimation performance, to other existing methods for various measurement models including nonlinear ones.


Multinomial sampling for hierarchical change-point detection
Lorena Romero-Medrano, Pablo Moreno-Munoz, Antonio Artés Rodríguez

Bayesian change-point detection, together with latent variable models, allows to perform segmentation over high-dimensional time-series. We assume that change-points lie on a lower-dimensional manifold where we aim to infer subsets of discrete latent variables. For this model, full inference is computationally unfeasible and pseudo-observations based on point-estimates are used instead. However, if estimation is not certain enough, change-point detection gets affected. To circumvent this problem, we propose a multinomial sampling methodology that improves the detection rate and reduces the delay while keeping complexity stable and inference analytically tractable. Our experiments show results that outperform the baseline method and we also provide an example oriented to a human behavior study.


Robust learning via ensemble density propagation in deep neural networks
Giuseppina Carannante, Dimah Dera, Ghulam Rasool, Nidhal Bouaynaya, Lyudmila Mihaylova

Learning in uncertain, noisy, or adversarial environments is a challenging task for deep neural networks (DNNs). We propose a new theoretically grounded and efficient approach for robust learning building on Bayesian analysis and Variational Inference. We formulate the problem of density propagation through layers of a Bayesian DNN and solve it using an Ensemble Density Propagation (EnDP) scheme. The EnDP approach allows us to propagate moments of the variational probability distribution across layers of a Bayesian DNN enabling the estimation of the mean and covariance of the predictive distribution at the output of the model. Our experiments using MNIST and CIFAR-10 datasets show a significant improvement in the robustness of the trained models to random noise and adversarial attacks.


Self-compression in Bayesian neural networks
Giuseppina Carannante, Dimah Dera, Ghulam Rasool, Nidhal Bouaynaya

Machine learning models have achieved human-level performance on various tasks. This success comes at a high cost of computation and storage overhead, which makes machine learning algorithms difficult to deploy on edge devices. Typically, one has to partially sacrifice accuracy in favor of an increased performance quantified in terms of reduced memory usage and energy consumption. Current methods compress the networks by reducing the precision of the parameters or by eliminating redundant ones. In this paper, we propose a new insight into network compression through the Bayesian framework. We show that Bayesian neural networks automatically discover redundancy in model parameters, thus enabling self-compression, which is linked to the propagation of uncertainties through the layers of the network. Our experimental results show that the network architecture can be successfully compressed by deleting parameters identified by the network itself while retaining the same level of accuracy.

Poster Session 2: Neural Networks and Applications 2

Chair: Arno Solin, Aalto University


Revealing perceptible backdoors in DNNs, without the training set, via the maximum achievable misclassification fraction statistic
Zhen Xiang, David Miller, Hang Wang, George Kesidis

Recently, a backdoor data poisoning attack was proposed, which adds mislabeled examples to the training set, with an embedded backdoor pattern, aiming to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test sample. We address post-training detection of innocuous perceptible backdoors in DNN image classifiers, wherein the defender does not have access to the poisoned training set. This problem is challenging because without the poisoned training set, we have no hint about the actual backdoor pattern used during training. We identify two properties of perceptible backdoor patterns – spatial invariance and robustness – based upon which we propose a novel detector using the maximum achievable misclassification fraction (MAMF) statistic. We detect whether the trained DNN has been backdoor-attacked and infer the source and target classes. Our detector outperforms other existing detectors experimentally.


A general framework for ensemble distribution distillation
Jakob Lindqvist, Amanda E. C. Olmin, Fredrik Lindsten, Lennart Svensson

Ensembles of neural networks have shown to give better predictive performance and more reliable uncertainty estimates than individual networks. Additionally, ensembles allow the uncertainty to be decomposed into aleatoric (data) and epistemic (model) components, giving a more complete picture of the predictive uncertainty. Ensemble distillation is the process of compressing an ensemble into a single model, often resulting in a leaner model that still outperforms the individual ensemble members. Unfortunately, standard distillation erases the natural uncertainty decomposition of the ensemble. We present a general framework for distilling both regression and classification ensembles in a way that preserves the decomposition. We demonstrate the desired behaviour of our framework and show that its predictive performance is on par with standard distillation.


Wavelet-based convolutional neural network design with anapplication to dual-energy CBCT pre-spectral-decomposition filtering
Luis Albert Zavala Mondragon, Peter H. N. de With, Danny Ruijters, Peter van de Haar, Fons van der Sommen

Convolutional Neural Networks (CNNs) are reshaping signal processing and computer vision by providing data-driven solutions for inverse problems such as noise reduction. However, their relationship with established signal processing methods is sometimes unclear and its development not fully exploiting the existing knowledge. In this paper, rather than improving existing CNNs with wavelet transformations as explored earlier, we improve the wavelet shrinkage approach to noise-reduction with a data-driven solution. The resulting CNN has clear encoding, decoding and processing paths. As application, we perform noise reduction in Dual-Energy Cone-Beam CT. The obtained results were compared to a UNet-like architecture, which reveal better noise-free images without aliasing artifacts. This indicates that that our architecture is able to preserve well the information contained in the images because the architecture exploits explicitly the underlying signal representation.


Quadratic mutual information regularization in real-time deep CNN models
Maria Tzelepi, Anastasios Tefas

In this paper, regularized lightweight deep convolutional neural network models, capable of effectively operating in real-time on devices with restricted computational power for high-resolution video input are proposed. Furthermore, a novel regularization method motivated by the Quadratic Mutual Information, in order to improve the generalization ability of the utilized models is proposed. Extensive experiments on various binary classification problems involved in autonomous systems are performed, indicating the effectiveness of the proposed models as well as of the proposed regularizer.


Motion pattern recognition in 4D point clouds
Dariush Salami, Sameera Palipana, Manila Kodali, Stephan Sigg

We address an actively discussed problem in signal processing, recognizing patterns from spatial data in motion. In particular, we suggest a neural network architecture to recognize motion patterns from 4D point clouds. We demonstrate the feasibility of our approach with point cloud datasets of hand gestures. The architecture, PointGest, directly feeds on unprocessed timelines of point cloud data without any need for voxelization or projection. The model is resilient to noise in the input point cloud through abstraction to lower-density representations, especially for regions of high density. We evaluate the architecture on a benchmark dataset with ten gestures. PointGest achieves an accuracy of 98.8%, outperforming five state-of-the-art point cloud classification models.


Improving deep reinforcement learning for financial trading using neural network distillation
Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas

Deep Reinforcement Learning (RL) is increasingly used for developing financial trading agents for a wide range of tasks. However, optimizing deep RL agents is known to be notoriously difficult and unstable, hindering the performance of financial trading agents. In this work, we propose a novel method for training deep RL agents, leading to better performing and more efficient RL agents. The proposed method works by first training a large and complex deep RL agent and then transferring the knowledge into a smaller and more efficient agent using neural network distillation. The ability of the proposed method to significantly improve deep RL for financial trading is demonstrated using experiments on a time series dataset consisting of Foreign Exchange (FOREX) trading pairs prices.


Quaternion neural networks for 3D sound source localization in reverberant environments
Michela Ricciardi Celsi, Simone Scardapane, Danilo Comminiello

Localization of sound sources in 3D sound fields is an extremely challenging task, especially when the environments are reverberant and involve multiple sources. In this work, we propose a deep neural network to analyze audio signals recorded by 3D microphones and localize sound sources in a spatial sound field. In particular, we consider first-order Ambisonics microphones to capture 3D acoustic signals and represent them by spherical harmonic decomposition in the quaternion domain. Moreover, to improve the localization performance, we use quaternion input features derived from the acoustic intensity, which is strictly related to the direction of arrival (DOA) of a sound source. The proposed network architecture involves both quaternion-valued convolutional and recurrent layers. Results show that the proposed method is able to exploit both the quaternion-valued representation of ambisonic signals and to improve the localization performance with respect to existing methods.


Graph-adaptive activation functions for graph neural networks
Bianca Iancu, Luana Ruiz, Alejandro Ribeiro, Elvin Isufi

Activation functions are crucial in graph neural networks (GNNs) as they allow capturing the relationship between the input graph data and their representations. We propose activation functions for GNNs that adapt to the graph, and are also distributable. To incorporate the feature-topology coupling, nonlinearized nodal features are combined with trainable parameters in a form akin to graph convolutions. This leads to a graph-adaptive trainable nonlinear component of the GNN that can be implemented directly or via kernel transformations, thus, enriching the class of functions to represent the network data. We show permutation equivariance is always preserved and prove the graph-adaptive max nonlinearities are Lipschitz stable to input perturbations. Numerical experiments with source localization, finite-time consensus, distributed regression, and recommender systems confirm our findings and show improved performance compared with pointwise and state-of-the-art localized nonlinearities.


Deep convolutional neural network-based inverse filtering approach for speech de-reverberation
Hanwook Chung, Vikrant Tomar, Benoit Champagne

In this paper, we introduce a spectral-domain inverse filtering approach for single-channel speech de-reverberation using deep convolutional neural network (CNN). The main goal is to better handle realistic reverberant conditions where the room impulse response (RIR) filter is longer than the short-time Fourier transform (STFT) analysis window. To this end, we consider the convolutive transfer function (CTF) model for the reverberant speech signal. In the proposed framework, the CNN architecture is trained to directly estimate the inverse filter of the CTF model. Among various choices for the CNN structure, we consider the U-net which consists of a fully-convolutional auto-encoder network with skip-connections. Experimental results show that the proposed method provides better de-reverberation performance than the prevalent benchmark algorithms under various reverberation conditions.


A unified approach for target direction finding based on convolutional neural networks
Chong Wang, Wei Liu, Mengdi Jiang

A convolutional neural network (CNNs) based approach for target direction finding with the thinned coprime array (TCA) as an example is proposed. The ResNeXt network is adopted as the backbone network with a multi-label classification modification to find directions of an unknown number of targets. Unlike the traditional wisdom, where an additional co-array operation is needed for underdetermined direction finding (the number of sources is larger than the number of physical sensors), in the proposed approach, it is shown that the same network with raw data as its input can deal with both the overdetermined and underdetermined cases, although using covariance matrix of the data can reduce the complexity of the whole training process at the cost of a loss in performance.


Frequency domain-based perceptual loss for super resolution
Shane D Sims

We introduce Frequency Domain Perceptual Loss (FDPL), a loss function for single image super resolution (SR). Unlike previous loss functions used to train SR models, which are all calculated in the pixel (spatial) domain, FDPL is computed in the frequency domain. By working in the frequency domain we can encourage a given model to learn a mapping that prioritizes those frequencies most related to human perception. While the goal of FDPL is not to maximize the Peak Signal to Noise Ratio (PSNR), we found that there is a correlation between decreasing FDPL and increasing PSNR. Training a model with FDPL results in a higher average PSRN (30.94), compared to the same model trained with pixel loss (30.59), as measured on the Set5 image dataset. We also show that our method achieves higher qualitative results, which is the goal of a perceptual loss function. However, it is not clear that the improved perceptual quality is due to the slightly higher PSNR or the perceptual nature of FDPL.