Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem-for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.
translated by 谷歌翻译
消除非静止环境噪声的负面影响是自动语音识别的基础研究课题,仍然是一项重要的挑战。数据驱动的监督方法,包括基于深度神经网络的方法,最近已成为传统无监督方法的潜在替代方案,并且具有足够的训练,可以减轻无监督方法在各种真实声学环境中的缺点。有鉴于此,我们回顾了最近开发的,具有代表性的深度学习方法,用于解决语音的非固定加性和卷积退化问题,旨在为那些参与开发环境健全的语音识别系统的人提供指导。我们分别讨论了为语音识别系统的前端和后端开发的单通道和多通道技术,以及联合前端和后端培训框架。
translated by 谷歌翻译
This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
声学数据提供从生物学和通信到海洋和地球科学等领域的科学和工程见解。我们调查了机器学习(ML)的进步和变革潜力,包括声学领域的深度学习。 ML是用于自动检测和利用模式印度的广泛的统计技术家族。相对于传统的声学和信号处理,ML是数据驱动的。给定足够的训练数据,ML可以发现特征之间的复杂关系。通过大量的训练数据,ML candiscover模型描述复杂的声学现象,如人类语音和混响。声学中的ML正在迅速发展,具有令人瞩目的成果和未来的重大前景。我们首先介绍ML,然后在五个声学研究领域强调MLdevelopments:语音处理中的源定位,海洋声学中的源定位,生物声学,地震探测和日常场景中的环境声音。
translated by 谷歌翻译
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameteriza-tion used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.
translated by 谷歌翻译
This paper presents a novel speech feature enhancement technique based on a probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence phase sensitive) between the clean speech and the corrupting noise in the acoustic distortion process. The core of the enhancement algorithm is the MMSE (minimum mean square error) estimator for the log Mel power spectra of clean speech based on the phase-sensitive environment model, using highly efficient single-point, second-order Taylor series expansion to approximate the joint probability of clean and noisy speech modeled as a multivariate Gaussian. Since a noise estimate is required by the MMSE estimator, a high-quality, sequential noise estimation algorithm is also developed and presented. Both the noise estimation and speech feature enhancement algorithms are evaluated on the Aurora2 task of connected digit recognition. Noise-robust speech recognition results demonstrate that the new acoustic environment model which takes into account the relative phase in speech and noise mixing is superior to the earlier environment model which discards the phase under otherwise identical experimental conditions. The results also show that the sequential MAP (maximum a posteriori) learning for noise estimation is better than the sequential ML (maximum likelihood) learning, both evaluated under the identical phase-sensitive MMSE enhancement condition. Index Terms-Noise estimate, noise-robust ASR, phase-sensitive acoustic environment model, sequential algorithm, speech feature enhancement.
translated by 谷歌翻译
In this invited paper, my overview material on the same topic as presented in the plenary overview session of APSIPA-2011 and the tutorial material presented in the same conference [1] are expanded and updated to include more recent developments in deep learning. The previous and the updated materials cover both theory and applications, and analyze its future directions. The goal of this tutorial survey is to introduce the emerging area of deep learning or hierarchical learning to the APSIPA community. Deep learning refers to a class of machine learning techniques, developed largely since 2006, where many stages of non-linear information processing in hierarchical architectures are exploited for pattern classification and for feature learning. In the more recent literature, it is also connected to representation learning, which involves a hierarchy of features or concepts where higher-level concepts are defined from lower-level ones and where the same lower-level concepts help to define higher-level ones. In this tutorial survey, a brief history of deep learning research is discussed first. Then, a classificatory scheme is developed to analyze and summarize major work reported in the recent deep learning literature. Using this scheme, I provide a taxonomy-oriented survey on the existing deep architectures and algorithms in the literature, and categorize them into three classes: generative, discriminative, and hybrid. Three representative deep architectures-deep autoencoders, deep stacking networks with their generalization to the temporal domain (recurrent networks), and deep neural networks (pretrained with deep belief networks)-one in each of the three classes, are presented in more detail. Next, selected applications of deep learning are reviewed in broad areas of signal and information processing including audio/speech, image/vision, multimodality, language modeling, natural language processing, and information retrieval. Finally, future directions of deep learning are discussed and analyzed.
translated by 谷歌翻译
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker's speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
translated by 谷歌翻译
This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future.
translated by 谷歌翻译
In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Index Terms-Deep neural networks (DNNs), dropout, global variance equalization, noise aware training, noise reduction, non-stationary noise, speech enhancement.
translated by 谷歌翻译
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge , which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
translated by 谷歌翻译
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract dis-criminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.
translated by 谷歌翻译
This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described. Index Terms-text-to-speech synthesis, hidden Markov model, HMM-based speech synthesis, statistical parametric speech synthesis , HTS
translated by 谷歌翻译
本文提出了一种深度语音增强方法,该方法利用广泛神经网络结构中的残余连接的高电位,称为宽残余网络。这在与时域一起计算的单维卷积上得到支持,这是通过时间域处理上下文相关表示的强有力的方法,例如语音特征序列。我们发现残余机制对于增强任务非常有用,因为信号总是具有线性捷径,并且非线性路径通过添加或减去校正在几个步骤中增强它。该提议的增强能力由客观质量度量和语音识别系统的性能评估。这是在REVERB Challengedataset的框架内进行评估的,包括混响和嘈杂语音信号的模拟和真实样本。结果表明,该方法的增强语音成功,具有可懂度的增强任务和语音识别系统。通过人工合成混响数据训练的DNN模型能够处理来自真实场景的远场混响语音。此外,该方法能够利用残余连接实现增强低噪声信号,这通常是传统增强方法的一个强大障碍。
translated by 谷歌翻译
We have made signiicant progress in automatic speech recognition ASR for well-deened applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach h uman levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech is one such source for making large improvements in high noise environments with the potential of channel and task independence. It is not eeected by the acoustic environment and noise, and it possibly contains the greatest amount of complementary information to the acoustic signal. In this workshop, our goal was to advance the state-of-the-art in ASR by demonstrating the use of visual information in addition to the traditional audio for large vocabulary continuous speech recognition LVCSR. Starting with an appropriate audiovisual database, collected and provided by IBM, we demonstrated for the rst time that LVCSR performance can be improved by the use of visual information in the clean audio case. Speciically, b y conducting audio lattice rescoring experiments, we showed a 7 relative word error rate WER reduction in that condition. Furthermore, for the harder problem of speech contaminated by s p e e c h babble" noise at 10 dB SNR, we demonstrated that recognition performance can beimproved by 27 in relative WER reduction, compared to an equivalent audio-only rec-ognizer matched to the noise environment. We believe that this paves the way to seriously address the challenge of speech recognition in high noise environments and to potentially achieve human levels of performance. In this report, we detail a number of approaches and experiments conducted during the summer workshop in the areas of visual feature extraction , hidden Markov model based visual-only recognition, and audiovisual information fusion. The later was our main concentration: In the workshop, a numberof feature fusion as well as decision fusion techniques for audiovisual ASR were explored and compared.
translated by 谷歌翻译
在过去几年中,自动说话人识别(ASV)的表现攻击检测(PAD)领域取得了重大进展。这包括开发新的语音语料库,标准评估协议以及前端特征提取和后端分类器的进步。 。标准数据库和评估协议的使用首次实现了对不同PAD解决方案的有意义的基准测试。本章总结了进展,重点关注过去三年完成的研究。本文概述了两个ASVspoof挑战的结果和经验教训,这是第一个以社区为主导的基准测试工作。这表明ASV PAD仍然是一个尚未解决的问题,需要进一步关注开发广泛的PAD解决方案,这些解决方案有可能检测出多样化和以前看不见的欺骗行为。攻击。
translated by 谷歌翻译
A unified view of the area of sparse signal processing is presented in tutorial form by bringing together various fields in which the property of sparsity has been successfully exploited. For each of these fields, various algorithms and techniques, which have been developed to leverage sparsity, are described succinctly. The common potential benefits of significant reduction in sampling rate and processing manipulations through sparse signal processing are revealed. The key application domains of sparse signal processing are sampling, coding, spectral estimation, array processing, component analysis, and multipath channel estimation. In terms of the sampling process and reconstruction algorithms, linkages are made with random sampling, compressed sensing, and rate of innovation. The redundancy introduced by channel coding in finite and real Galois fields is then related to over-sampling with similar reconstruction algorithms. The error locator polynomial (ELP) and iterative methods are shown to work quite effectively for both sampling and coding applications. The methods of Prony, Pisarenko, and MUltiple SIgnal Classification (MUSIC) are next shown to be targeted at analyzing signals with sparse frequency domain representations. Specifically, the relations of the approach of Prony to an annihilating filter in rate of innovation and ELP in coding are emphasized; the Pisarenko and MUSIC methods are further improvements of the Prony method under noisy environments. The iterative methods developed for sampling and coding applications are shown to be powerful tools in spectral estimation. Such narrowband spectral estimation is then related to multi-source location and direction of arrival estimation in array processing. Sparsity in unobservable source signals is also shown to facilitate source separation in sparse component analysis; the algorithms developed in this area such as linear programming and matching pursuit are also widely used in compressed sensing. Finally, the multipath channel estimation problem is shown to have a sparse formulation; algorithms similar to sampling and coding are used to estimate typical multicarrier communication channels.
translated by 谷歌翻译
Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise robustness of very deep convolutional neural networks (VDCNN). Based on our work on very deep CNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network (VDCRN). This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs. Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This work focuses on factor aware training (FAT) and cluster adaptive training (CAT). For factor aware training, a unified framework is explored. For cluster adaptive training, two schemes are first explored to construct the bases in the canonical model; furthermore a factorized version of CAT is designed to address multiple non-speech variabilities in one model. Finally, a complete multi-pass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 (simulated data with additive noise and channel distortion), CHiME4 (both simulated and real data with additive noise and reverberation) and the AMI meeting transcription task (real data with significant reverberation). The evaluation includes not only different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate. The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or LSTM. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.
translated by 谷歌翻译
Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.
translated by 谷歌翻译