MLSP 2020

IEEE International Workshop on

September 21–24, 2020 Aalto University, Espoo, Finland (virtual conference)

Monday, September 21, 2020


Lecture Session 1: Neural Networks and Applications 1

Chair: Andreas Hauptmann, University of Oulu


DCGAN for the synthesis of multivariate multifractal textures: How do we know it works?
Vincent Mauduit, Patrice Abry, Roberto Leonarduzzi, Stephane Roux, Emmanuel Quemener

Deep Learning is nowadays widely used for several tasks in image processing. Notably, it has been massively used for image synthesis, mostly however with strong geometrical contents. Focused on a race for better performance via more complex architectures, research on Deep Learning, however, left behind the critical issue of assessing quantitatively and in a reproducible manner the quality of the synthesized images, notably for the case of pure textures. The present work aims to study the ability of Deep Convolutional Generative Adversarial Networks to synthesize multivariate textures characterized by rich multiscale multivariate statistics (multifractals). The focus is thus on quantifying the quality of the synthesized textures, on assessing the reproducibility of the learning procedure and on studying the impact of loss functions and of training dataset sizes, rather than on proposing yet another architecture.


Low-light environment neural surveillance
Michael L Potter, Henry Gridley, Noah Lichtenstein, Kevin Hines, John Nguyen, Jacob Walsh

We design and implement an end-to-end system for real-time crime detection in low-light environments. Unlike Closed-Circuit Television, which performs reactively, the Low-Light Environment Neural Surveillance provides real time crime alerts. The system uses a low-light video feed processed in real-time by an optical-flow network, spatial and temporal networks, and a Support Vector Machine to identify shootings, assaults, and thefts. We create a low-light action-recognition dataset, LENS-4, which will be publicly available. An IoT infrastructure set up via Amazon Web Services interprets messages from the local board hosting the camera for action recognition and parses the results in the cloud to relay messages. The system achieves 71.5% accuracy at 20 FPS. The user interface is a mobile app which allows local authorities to receive notifications and to view a video of the crime scene. Citizens have a public app which enables law enforcement to push crime alerts based on user proximity.


QISTA-Net: DNN architecture to solve \(\ell_q\)-norm minimization problem
Gang-Xuan Lin, Chun-Shien Lu

In this paper, we reformulate the non-convex \(\ell_q\)-norm minimization problem with \(q\in(0,1)\) into a 2-step problem, which consists of one convex and one non-convex subproblems, and propose a novel iterative algorithm called QISTA (\(\ell_q\)-ISTA) to solve the \(\left(\ell_q\right)\)-problem. By taking advantage of DNN in accelerating optimization algorithms, we also design a DNN architecture associated with QISTA, called QISTA-Net, which is then further speeded up as QISTA-Net\(^+\) using the momentum from all previous layers. Extensive experimental comparisons demonstrate that the proposed methods yield better reconstruction qualities than state-of-the-art \(\ell_1\)-norm optimization (plus learning) algorithms even if the original sparse signal is noisy.


End-to-end learning for retrospective change-point estimation
Corinne Jones, Zaid Harchaoui

We propose an approach to retrospective change-point estimation that includes learning feature representations from data. The feature representations are specified within a differentiable programming framework, that is, as parameterized mappings amenable to automatic differentiation. The proposed method uses these feature representations in a penalized least-squares objective into which known change-point labels can be incorporated. We propose to minimize the objective using an alternating optimization procedure. We present numerical illustrations on synthetic and real data showing that learning feature representations can result in more accurate estimation of change-point locations.


Robust classification using hidden Markov models and mixtures of normalizing flows
Anubhab Ghosh, Antoine Honore, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise. The hypothesis is that a generative model, that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distribution for the hidden states of the HMM, can provide a robust classification performance. The combined model is called normalizing-flow mixture model based HMM (NMM-HMM). It can be trained using a combination of expectation-maximization and backpropagation. We verify the improved robustness of NMM-HMM classifiers in an application to speech recognition.


Regularizing neural networks by stochastically training layer ensembles
Alex Labach, Shahrokh Valaee

Dropout and similar stochastic neural network regularization methods are often interpreted as implicitly averaging over a large ensemble of models. We propose STE (stochastically trained ensemble) layers, which enhance the averaging properties of such methods by training an ensemble of weight matrices with stochastic regularization while explicitly averaging outputs. This provides stronger regularization with no additional computational cost at test time. We show consistent improvement on various image classification tasks using standard network topologies.

Poster Session 1: Applications in Multimedia and Biomedical Signal Processing

Chair: Lassi Roininen, Lappeenranta University of Technology


ModeNet: Mode selection network for learned video coding
Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Deforges

In this paper, a mode selection network (ModeNet) is proposed to enhance deep learning-based video compression. Inspired by traditional video coding, ModeNet purpose is to enable competition among several coding modes. The proposed ModeNet learns and conveys a pixel-wise partitioning of the frame, used to assign each pixel to the most suited coding mode. ModeNet is trained alongside the different coding modes to minimize a rate-distortion cost. It is a flexible component which can be generalized to other systems to allow competition between different coding tools. ModeNet interest is studied on a P-frame coding task, where it is used to design a method for coding a frame given its prediction. ModeNet-based systems achieve compelling performance when evaluated under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding track conditions.


Modeling phone call durations via switching poisson processes with applications in mental health
Pablo Bonilla-Escribano, David Ramírez, Antonio Artés Rodríguez

This work models phone call durations via switching Poisson point processes. This kind of processes is composed by two intertwined intensity functions: one models the start of a call, whereas the other one models when the call ends. Thus, the call duration is obtained from the inverse of the intensity function of finishing a call. Additionally, to model the circadian rhythm present in human behavior, we shall use a (positive) truncated Fourier series as the parametric form of the intensities. Finally, the maximum likelihood estimates of the intensity functions are obtained using a trust region method and the performance is evaluated on synthetic and real data, showing good results.


Blind audio source separation using two expectation-maximization algorithms
Aviad Eisenberg, Boaz Schwartz, Sharon Gannot

The problem of multi-microphone blind audio source separation in noisy environment is addressed. The estimation of the acoustic signals and the associated parameters is carried out using the expectation-maximization algorithm. Two separation algorithms are developed using either deterministic representation or stochastic Gaussian distribution for modelling the speech signals. Under the deterministic model, the speech sources are estimated in the M-step by applying in parallel multiple minimum variance distortionless response (MVDR) beamformers, while under the stochastic model, the speech signals are estimated in the E-step by applying in parallel multiple multichannel Wiener filters (MCWF). In the simulation study, we generated a large dataset of microphone signals, by convolving speech signals, with overlapping activity patterns, by measured acoustic impulse responses. It is shown that the proposed methods outperform a baseline method in terms of speech quality and intelligibility.


A multi-patch aggregated aesthetic rating system based on eyefixation
Yung Yuan Tseng, Tien-Ruey Hsiang

Due to individuals have their own aesthetics so evaluations of the same photograph may differ, it is essential to develop an assessment system that approximates the aesthetics of general public. This paper utilize the large-scale aesthetic Aesthetic Visual Analysis dataset (AVA) to help establish an aesthetic assessment system. In addition to a machine perspective, we further propose a human-like eye fixation method that enables machines to learn from human perspective when analyzing aesthetics from data, which can be used as a reference for future machine learning systems to learn abstract features from a human perspective. Our system selects the areas that attract the most attention of viewers as patches and uses the overall image as the overall image layout, which are then analyzed by the multi-patch aggregated aesthetic model. The performance improvements are validated by linear correlation coefficient and mean square error.


Pulse ID: The case for robustness of ECG as a biometric identifier
Vishnu Chandrashekhar, Prerna Singh, Mihir Paralkar, Ozan K. Tonguz

Electrocardiogram (ECG) signals are known to encode unique signatures based on the geometrical characteristics of the heart. Due to other advantages –- such as continuity and accessibility (now via smartwatch technology) – ECG could make for a robust biometric ID system. We show that single-node ECG measurements through an Apple Watch would suffice to identify an individual. Apart from the Apple Watch ECG data, we have also performed analysis on two other ECG datasets from PhysioNet to test the robustness of our methods in two situations: in particular, we tested how it holds up against high volume (across a large number of individuals) and high variability (across different states of activity). We have also compared multiple classifier models in combination with different feature sets to identify the most superior combination. We observed Equal Error Rate (EER) values that were consistently <3%. Our results show that ECG proves to be very effective and robust.


F0 estimation using blind source separation for analyzing Noh singing
Atsuki Tamoto, Katunobu Itou

The purpose of this study is to extract singing melody from mixed sounds related to Noh performances. Noh sounds include singing, accompaniments, and other elements. For analyzing Noh singing, we need singing solos, but they are hard to collect since there are only a few sources of solo passages. Therefore, we focus on the extraction of singing melody from mixtures of accompaniments and singing. In this paper, we demonstrate that source separation can be introduced as an efficient preprocessing step for Noh singing melody extraction. In addition, we compare melody extraction based on a convolutional neural network (CNN) approach with Melodia, a plug-in for melody extraction which is particularly accurate in the presence of music with wide fluctuations in pitch. We also demonstrate that CNN-based melody estimation can be efficiently trained using singing after source separation.


Lumen & media segmentation of IVUS images via ellipse fitting using a wavelet-decomposed subband CNN
Pavel Sinha, Ioannis Psaromiligkos, Zeljko Zilic

We propose an automatic segmentation method for both lumen and media in IntraVascular UltraSound (IVUS) images using a deep convolutional neural network (CNN). In contrast to previous approaches that broadly fall under the category of labeling each pixel to be either lumen, media or background, we propose to use a structurally regularized CNN via wavelet-based subband decomposition that directly predicts two ellipses that best represent each of lumen and media segments. The proposed architecture significantly reduces computational complexity and offers better performance compared to recent techniques in the literature. We evaluated our network on the publicly available IVUS-Challenge-2011 dataset using two performance metrics, namely Jaccard Measure (JM) and Hausdorff Distance (HD). The evaluation results show that our proposed network outperforms the state-of-the-art lumen and media segmentation methods by a maximum of 8% in JM (Lumen) and nearly 33% in HD (Media).


Towards an explainable mortality prediction model
Jacob R Epifano, Ghulam Rasool, Ravi Ramachandran, Sharad Patel

Influence functions are analytical tools from robust statistics that can help interpret the decisions of black-box machine learning models. Influence functions can be used to attribute changes in the loss function due to small perturbations in the input features. The current work on using influence functions is limited to the features available before the last layer of deep neural networks (DNNs). We extend the influence function approximation to DNNs by computing gradients in an end-to-end manner and relate changes in the loss function to individual input features using an efficient algorithm. We propose an accurate mortality prediction neural network and show the effectiveness of extended influence functions on the eICU dataset. The features chosen by proposed extended influence functions were more like those selected by human experts than those chosen by other traditional methods.


Convolutional recurrent neural network based direction of arrival estimation method using two microphones for hearing studies
Abdullah Kucuk, Issa Panahi

This work proposes a convolutional recurrent neural network (CRNN) based direction of arrival (DOA) angle estimation method, implemented on the Android smartphone for hearing aid applications. The proposed app provides a 'visual' indication of the direction of a talker on the screen of Android smartphones for improving the hearing of people with hearing disorders. We use real and imaginary parts of short-time Fourier transform (STFT) as a feature set for the proposed CRNN architecture for DOA angle estimation. Real smartphone recordings are utilized for assessing performance of the proposed method. The accuracy of the proposed method reaches 87.33% for unseen (untrained) environments. This work also presents real-time inference of the proposed method, which is done on an Android smartphone using only its two built-in microphones and no additional component or external hardware. The real-time implementation also proves the generalization and robustness of the proposed CRNN based model.


Double JPEG compression detection for distinguishable blocks in images compressed with same quantization matrix
Abhinav Narayan, Vinay Verma, Nitin Khanna

Detection of compression history is a crucial step in verifying the authenticity of a JPEG image. Previous approaches for double compression detection with the same quantization matrix are designed for full-sized images or large patches. In this paper, we propose a novel deep learning based approach that utilizes spatial and frequency domain information from the error blocks obtained from multiple compression stages and uses a multi-column CNN architecture to classify distinguishable blocks of size 8x8. Three successive error blocks are obtained from the given JPEG block and its repeated compression by taking the difference between inverse discrete cosine transform (DCT) of de-quantized DCT coefficients and the reconstructed blocks. On average, the performance gain of the proposed approach over the baseline method in terms of TPR, TNR, and balanced accuracy is 4.04%, 1.6%, and 2.8%, respectively. We also show the applicability of the method for unseen quality factors.