This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.
translated by 谷歌翻译
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameteriza-tion used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.
translated by 谷歌翻译
This paper introduces recent advances in speaker recognition technology. The first part discusses general topics and issues. The second part is devoted to a discussion of more specific topics of recent interest that have led to interesting new approaches and techniques. They include VQ-and ergodic-HMM-based text-independent recognition methods, a text-prompted recognition method, parameter/ distance normalization and model adaptation techniques, and methods of updating models and a priori thresholds in speaker verification. Although many recent advances and successes have been achieved in speaker recognition, there are still many problems for which good solutions remain to be found. The last part of this paper describes 16 open questions about speaker recognition. The paper concludes with a short discussion assessing the current status and future possibilities.
translated by 谷歌翻译
In this paper we describe the major elements of MIT Lincoln Labo-ratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
translated by 谷歌翻译
A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct identification. Last, the performances of various systems are compared.
translated by 谷歌翻译
We have made signiicant progress in automatic speech recognition ASR for well-deened applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach h uman levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech is one such source for making large improvements in high noise environments with the potential of channel and task independence. It is not eeected by the acoustic environment and noise, and it possibly contains the greatest amount of complementary information to the acoustic signal. In this workshop, our goal was to advance the state-of-the-art in ASR by demonstrating the use of visual information in addition to the traditional audio for large vocabulary continuous speech recognition LVCSR. Starting with an appropriate audiovisual database, collected and provided by IBM, we demonstrated for the rst time that LVCSR performance can be improved by the use of visual information in the clean audio case. Speciically, b y conducting audio lattice rescoring experiments, we showed a 7 relative word error rate WER reduction in that condition. Furthermore, for the harder problem of speech contaminated by s p e e c h babble" noise at 10 dB SNR, we demonstrated that recognition performance can beimproved by 27 in relative WER reduction, compared to an equivalent audio-only rec-ognizer matched to the noise environment. We believe that this paves the way to seriously address the challenge of speech recognition in high noise environments and to potentially achieve human levels of performance. In this report, we detail a number of approaches and experiments conducted during the summer workshop in the areas of visual feature extraction , hidden Markov model based visual-only recognition, and audiovisual information fusion. The later was our main concentration: In the workshop, a numberof feature fusion as well as decision fusion techniques for audiovisual ASR were explored and compared.
translated by 谷歌翻译
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker's speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
translated by 谷歌翻译
在过去几年中,自动说话人识别(ASV)的表现攻击检测(PAD)领域取得了重大进展。这包括开发新的语音语料库,标准评估协议以及前端特征提取和后端分类器的进步。 。标准数据库和评估协议的使用首次实现了对不同PAD解决方案的有意义的基准测试。本章总结了进展,重点关注过去三年完成的研究。本文概述了两个ASVspoof挑战的结果和经验教训,这是第一个以社区为主导的基准测试工作。这表明ASV PAD仍然是一个尚未解决的问题,需要进一步关注开发广泛的PAD解决方案,这些解决方案有可能检测出多样化和以前看不见的欺骗行为。攻击。
translated by 谷歌翻译
Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic + noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.
translated by 谷歌翻译
声学数据提供从生物学和通信到海洋和地球科学等领域的科学和工程见解。我们调查了机器学习(ML)的进步和变革潜力,包括声学领域的深度学习。 ML是用于自动检测和利用模式印度的广泛的统计技术家族。相对于传统的声学和信号处理,ML是数据驱动的。给定足够的训练数据,ML可以发现特征之间的复杂关系。通过大量的训练数据,ML candiscover模型描述复杂的声学现象,如人类语音和混响。声学中的ML正在迅速发展,具有令人瞩目的成果和未来的重大前景。我们首先介绍ML,然后在五个声学研究领域强调MLdevelopments:语音处理中的源定位,海洋声学中的源定位,生物声学,地震探测和日常场景中的环境声音。
translated by 谷歌翻译
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem-for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.
translated by 谷歌翻译
Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audiovisual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to large-vocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.
translated by 谷歌翻译
Despite surveillance systems are becoming increasingly ubiquitous in our living environment, automated surveillance, currently based on video sensory modality and machine intelligence, lacks most of the time the ro-bustness and reliability required in several real applications. To tackle this issue, audio sensory devices have been taken into account, both alone or in combination with video, giving birth, in the last decade, to a considerable amount of research. In this paper audio-based automated surveillance methods are organized into a comprehensive survey: a general taxonomy, inspired by the more widespread video surveillance field, is proposed in order to systematically describe the methods covering background subtraction, event classification, object tracking and situation analysis. For each of these tasks, all the significant works are reviewed, detailing their pros and cons and the context for which they have been proposed. Moreover, a specific section is devoted to audio features, discussing their expressiveness and their employment in the above described tasks. Differently, from other surveys on audio processing and analysis, the present one is specifically targeted to automated surveillance, highlighting the target applications of each described methods and providing the reader tables and schemes useful to retrieve the most suited algorithms for a specific requirement.
translated by 谷歌翻译
消除非静止环境噪声的负面影响是自动语音识别的基础研究课题,仍然是一项重要的挑战。数据驱动的监督方法,包括基于深度神经网络的方法,最近已成为传统无监督方法的潜在替代方案,并且具有足够的训练,可以减轻无监督方法在各种真实声学环境中的缺点。有鉴于此,我们回顾了最近开发的,具有代表性的深度学习方法,用于解决语音的非固定加性和卷积退化问题,旨在为那些参与开发环境健全的语音识别系统的人提供指导。我们分别讨论了为语音识别系统的前端和后端开发的单通道和多通道技术,以及联合前端和后端培训框架。
translated by 谷歌翻译
While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
translated by 谷歌翻译
Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties , this paper proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with the ASR transcripts provided by NIST, VAD in the ETSI-AMR Option 2 coder, satistical-model (SM) based VAD, and Gaussian mixture model (GMM) based VAD. Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms these conventional ones whenever interview-style speech is involved. This study also demonstrates that (1) noise reduction is vital for energy-based VAD under low SNR; (2) the ASR transcripts and ETSI-AMR speech coder do not produce accurate speech and non-speech segmentations; and (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD. The segmentation files produced by the proposed VAD can be found in http://bioinfo.eie.polyu.edu.hk/ssvad.
translated by 谷歌翻译
Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs. Abstract-Probabilistic approaches can offer satisfactory solutions to source separation with a single channel, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, it is not always possible to train such models. To overcome this problem, we propose to resort to an adaptation scheme for adjusting the source models with respect to the actual properties of the signals observed in the mix. In this paper, we introduce a general formalism for source model adaptation which is expressed in the framework of Bayesian models. Particular cases of the proposed approach are then investigated experimentally on the problem of separating voice from music in popular songs. The obtained results show that an adaptation scheme can improve consistently and significantly the separation performance in comparison with nonadapted models. Index Terms-Adaptive Wiener filtering, Bayesian model, expectation maximization (EM), Gaussian mixture model (GMM), maximum a posteriori (MAP), model adaptation, single-channel source separation, time-frequency masking.
translated by 谷歌翻译
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge , which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
translated by 谷歌翻译
This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future.
translated by 谷歌翻译