In this paper, we present a time-contrastive learning (TCL) based bottleneck (BN) feature extraction method for speech signals with an application to text-dependent (TD) speaker verification (SV). It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure. More specifically, it trains deep neural networks (DNNs) to discriminate temporal events obtained by uniformly segmenting speech signals, in contrast to existing DNN based BN feature extraction methods that train DNNs using labeled data to discriminate speakers or pass-phrases or phones or a combination of them. In the context of speaker verification, speech data of fixed pass-phrases are used for TCL-BN training, while the pass-phrases used for TCL-BN training are excluded from being used for SV, so that the learned features can be considered generic. The method is evaluated on the RedDots Challenge 2016 database. Experimental results show that TCL-BN is superior to the existing speaker and pass-phrase discriminant BN features and the Mel-frequency cepstral coefficient feature for text-dependent speaker verification.
translated by 谷歌翻译
Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear dis-criminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.
translated by 谷歌翻译
口语密码短语验证的任务是确定testutterance是否包含与给定注册话语相同的短语。除了其他应用之外,密码短语验证可以在依赖于文本的说话者验证中补充独立的说话者验证子系统。它还可以通过验证用户是否能够正确地响应随机提示的短语来用于活跃度检测。在本文中,我们基于基于i-vector的文本相关说话人验证的前期工作,我们已经展示了使用短语特定隐马尔可夫模型(HMM)或使用基于深度神经网络(DNN)的瓶颈(BN)提取的i向量)功能有助于拒绝错误的密码短语。我们将相同的i-vector提取技术应用于独立于说话者的口头短语分类和验证的独立任务。 RSR2015和RedDots数据库上的实验表明,应用于这种i向量的非常简单的评分技术(例如余弦距离评分)可以提供优于先前在相同数据上公布的结果的结果。
translated by 谷歌翻译
使用短话语的与文本无关的说话人识别是一项具有高度挑战性的任务,因为差异之间存在较大的变化和内容不匹配。基于I矢量的系统已成为扬声器验证应用的标准,但它们在短话语中效果较差。本文首先比较两种最先进的通用背景模型训练方法,用于使用全长的i矢量建模和短暂的话语评估任务。这两种方法是基于高斯混合模型(GMM)和基于深度神经网络(DNN)的方法。结果表明,I-vector_DNN系统在各种方面都优于I-vector_GMM系统。然而,随着话语持续时间的减少,两个系统的性能都会显着降低。为了解决这个问题,我们提出了两种非线性映射方法,它们训练DNN模型将从短话语中提取的i-矢量映射到它们相应的长话语矢量。映射的i-向量可以恢复丢失的信息并减少原始短话语i向量的变化。所提出的方法都通过使用自动编码器来模拟短和长话语i向量的联合表示。使用NIST SRE 2010数据集的实验结果表明,当使用具有残余块的深度编码器并添加额外的音素矢量时,这两种方法都提供了显着的改进,并且导致基线系统的等误差率的相对改善最大为28.43%。当在Wilddataset中的Speaker上进一步测试SRE10的最佳验证模型时,这些方法导致任意持续时间(1-5s)短发声条件的改善23.12%。
translated by 谷歌翻译
In this paper we investigate the use of deep neural networks (DNNs) for a small footprint text-dependent speaker verification task. At development stage, a DNN is trained to classify speakers at the frame-level. During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer. The average of these speaker features, or d-vector, is taken as the speaker model. At evaluation stage, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision. Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. Finally the combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively. Index Terms-Deep neural networks, speaker verification.
translated by 谷歌翻译
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic/speaker discriminative DNNs as feature extractors for speaker verification has shown promising results. The extracted frame-level (DNN bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use speaker discriminative CNNs to extract the noise-robust frame-level features. These features are then combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminative information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process-directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can automatically select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 "Hey Cortana" speaker verification task.
translated by 谷歌翻译
在本文中,我们提出了一种基于长短期记忆(LSTM)神经网络的端到端短话语音识别(SLD)方法,该方法特别适用于智能车辆中的SLD应用。用于LSTM学习的特征由一个生成。转移学习方法。训练用于声学 - 声学 - 语音分类的深度神经网络(DNN)的瓶颈特征用于LSTM训练。为了提高短话语的SLD准确度,使用基于相位声码器的时间尺度修改(TSM)方法来减少和增加测试话语的语音评级。通过拼接正常,语速降低和增加话语,我们可以延长测试话语的长度,从而提高SLD系统的性能。在AP17-OLR数据库上的实验结果表明,所提出的方法可以提高SLD的性能,特别是在持续时间为1s和3s的短语中。
translated by 谷歌翻译
This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.
translated by 谷歌翻译
The recent application of deep neural networks (DNN) to speaker identification (SID) has resulted in significant improvements over current state-of-the-art on telephone speech. In this work, we report the same achievement in DNN-based SID performance on microphone speech. We consider two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses the DNN during feature modeling. Modeling is conducted using the DNN/i-vector framework, in which the traditional universal background model is replaced with a DNN. The recently proposed use of bottleneck features extracted from a DNN is also evaluated. Systems are first compared with a conventional universal background model (UBM) Gaussian mixture model (GMM) i-vector system on the clean conditions of the NIST 2012 speaker recognition evaluation corpus, where a lack of robustness to microphone speech is found. Several methods of DNN feature processing are then applied to bring significantly greater robustness to microphone speech. To direct future research, the DNN-based systems are also evaluated in the context of audio degradations including noise and reverberation.
translated by 谷歌翻译
It is suggested that algorithms capable of estimating and characterizing accent knowledge would provide valuable information in the development of more effective speech systems such as speech recognition, speaker identification, audio stream tagging in spoken document retrieval, channel monitoring, or voice conversion. Accent knowledge could be used for selection of alternative pronunciations in a lexicon, engage adaptation for acoustic mod-eling, or provide information for biasing a language model in large vocabulary speech recognition. In this paper, we propose a text-independent automatic accent classification system using phone-based models. Algorithm formulation begins with a series of experiments focused on capturing the spectral evolution information as potential accent sensitive cues. Alternative subspace representations using principal component analysis and linear discriminant analysis with projected trajectories are considered. Finally, an experimental study is performed to compare the spectral trajectory model framework to a traditional hidden Markov model recognition framework using an accent sensitive word corpus. System evaluation is performed using a corpus representing five English speaker groups with native American English, and English spoken with Mandarin Chinese, French, Thai, and Turkish accents for both male and female speakers.
translated by 谷歌翻译
In acoustic modeling, speaker adaptive training (SAT) has been a long-standing technique for the traditional Gaussian mixture models (GMMs). Acoustic models trained with SAT become independent of training speakers and generalize better to unseen testing speakers. This paper ports the idea of SAT to deep neural networks (DNNs), and proposes a framework to perform feature-space SAT for DNNs. Using i-vectors as speaker representations, our framework learns an adaptation neural network to derive speaker-normalized features. Speaker adaptive models are obtained by fine-tuning DNNs in such a feature space. This framework can be applied to various feature types and network structures, posing a very general SAT solution. In this work, we fully investigate how to build SAT-DNN models effectively and efficiently. First, we study the optimal configurations of SAT-DNNs for large-scale acoustic modeling tasks. Then, after presenting detailed comparisons between SAT-DNNs and the existing DNN adaptation methods, we propose to combine SAT-DNNs and model-space DNN adaptation during decoding. Finally, to accelerate learning of SAT-DNNs, a simple yet effective strategy, frame skipping, is employed to reduce the size of training data. Our experiments show that compared with a strong DNN baseline, the SAT-DNN model achieves 13.5% and 17.5% relative improvement on word error rates (WERs), without and with model-space adaptation applied respectively. Data reduction based on frame skipping results in 2× speed-up for SAT-DNN training, while causing negligible WER loss on the testing data. Index Terms-Deep neural networks, speaker adaptive training , acoustic modeling.
translated by 谷歌翻译
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR). Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments. The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged. Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE). The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions.
translated by 谷歌翻译
无监督子字建模旨在学习“零资源”设置中的语音音频的低级表示:即,不使用转录或来自目标语言的其他资源(例如文本语料库或发音词典)。一个好的表示应该捕捉语音内容和摘要远离其他类型的可变性,如说话者差异和频道噪音。此领域的先前工作主要集中在仅从目标语言数据中学习,并且仅在本质上进行了评估。在这里,我们直接比较多种方法,包括一些仅使用目标语言语音数据的方法和一些使用来自其他(非目标)语言的转录语音的方法,并且我们使用两个内在度量以及下游无监督分词和聚类任务来评估。我们发现,结合两种现有的仅使用目标语言的方法比单独使用任何一种方法都能产生更好的特征。然而,通过使用其他语言训练的模型提取目标语言瓶颈特征,获得了更好的结果。仅使用一种语言进行跨语言培训就足以提供这种益处,但多语言培训可以提供更多帮助。除了这些包含内在对策和外在任务的结果之外,我们还讨论了不同类型的学习特征之间的定性差异。
translated by 谷歌翻译
在本文中,我们提出了一种基于卷积神经网络(CNN)的说话人识别模型,用于提取鲁棒的说话人嵌入。可以在嵌入层中通过线性激活有效地提取嵌入。理解说话人识别模型如何操作与文本无关的输入,我们修改结构以提取框架级扬声器嵌入到隐藏层。我们将来自TIMIT数据集的话语提供给训练有素的网络,并使用多个代理任务来研究网络表示语音输入和区分语音身份的能力。我们发现网络比单个音素更能区分广泛的语音类别。特别地,属于相同语音类的帧级嵌入对于相同的说话者是相似的(基于余弦距离)。帧级表示还允许我们在帧级分析网络,并且有可能进行其他分析以提高说话人识别。
translated by 谷歌翻译
This paper investigates replacing i-vectors for text-independent speaker verification with embeddings extracted from a feed-forward deep neural network. Long-term speaker characteristics are captured in the network by a temporal pooling layer that aggregates over the input speech. This enables the network to be trained to discriminate between speakers from variable-length speech segments. After training, utterances are mapped directly to fixed-dimensional speaker embeddings and pairs of embeddings are scored using a PLDA-based backend. We compare performance with a traditional i-vector baseline on NIST SRE 2010 and 2016. We find that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions. Moreover, the two representations are complementary, and their fusion improves on the baseline at all operating points. Similar systems have recently shown promising results when trained on very large proprietary datasets, but to the best of our knowledge, these are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
translated by 谷歌翻译
M ost current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
translated by 谷歌翻译
在本文中,我们提出了一种新的可微分神经网络对齐机制,用于文本相关的说话人验证,它使用对齐模型来产生话语的超向量表示。与之前使用类似方法的工作不同,我们不会从时间维度的平均减少中提取话语的嵌入。由于语音信息是验证任务中的身份的一部分,因此我们的系统用替换对齐模型替换平均值以保持每个短语的时间结构在本申请中是相关的。此外,我们可以应用卷积神经网络作为前端,并且由于对齐过程是不同的,我们可以训练整个网络产生一个超级向量的预言,这将同时区分说话者和短语。正如我们所示,这种选择的优势在于,超级向量对短语和说话者信息进行编码,从而在依赖于文本的说话者验证任务中提供良好的性能。在这项工作中,验证过程是使用基本的相似性度量来执行的,由于简单性,与常用的其他更精细的模型相比。使用对齐生成超向量的新模型在RSR2015-Part I数据库上进行了测试,用于文本相关的说话者验证,与使用平均值到外部嵌入的类似大小的网络相比,提供了竞争性结果。
translated by 谷歌翻译
In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential problem of the present model is that it involves a parametric classifier, i.e., the last affine layer, which may consume some discriminative knowledge, thus leading to 'information leak' for the feature learning. This paper presents a full-info training approach that discards the parametric classifier and enforces all the discriminative knowledge learned by the feature net. Our experiments on the Fisher database demonstrate that this new training scheme can produce more coherent features, leading to consistent and notable performance improvement on the speaker verification task. Index Terms-speaker recognition, deep neural network , speaker feature learning
translated by 谷歌翻译
A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection speech synthesis is its adaptability and controllability in changing speaker characteristics and speaking style. Recently, several studies using deep neu-ral networks (DNNs) as acoustic models for SPSS have shown promising results. However, the adaptability of DNNs in SPSS has not been systematically studied. In this paper, we conduct an experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels. In particular, we augment a low-dimensional speaker-specific vector with linguistic features as input to represent speaker identity, perform model adaptation to scale the hidden activation weights, and perform a feature space transformation at the output layer to modify generated acoustic features. We systematically analyse the performance of each individual adaptation technique and that of their combinations. Experimental results confirm the adaptability of the DNN, and listening tests demonstrate that the DNN can achieve significantly better adaptation performance than the hidden Markov model (HMM) baseline in terms of naturalness and speaker similarity.
translated by 谷歌翻译