最先进的说话人识别系统包括嵌入前端的x矢量(ori-vector)扬声器,然后是概率线性判别分析(PLDA)后端。这些组件的有效性依赖于大量标记的训练数据的可用性。实际上,部署系统的域(例如,语言,人口统计)与我们训练系统的域不同。为了弥补由于域不匹配导致的差距,我们提出了一种无监督的PLDA自适应算法,以便从少量未标记的域内数据中学习。所提出的方法的灵感来自于基于特征的领域适应技术的先前工作,称为相关对齐(CORAL)。我们将本文提出的基于模型的自适应技术称为CORAL +。所提出的技术的功效在最近的NIST 2016和2018说话人识别评估(SRE'16,SRE'18)数据集上通过实验验证。
translated by 谷歌翻译
This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC). We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover , we integrate the DNN in an unsupervised adaptation framework , that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.
translated by 谷歌翻译
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameteriza-tion used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.
translated by 谷歌翻译
This paper investigates replacing i-vectors for text-independent speaker verification with embeddings extracted from a feed-forward deep neural network. Long-term speaker characteristics are captured in the network by a temporal pooling layer that aggregates over the input speech. This enables the network to be trained to discriminate between speakers from variable-length speech segments. After training, utterances are mapped directly to fixed-dimensional speaker embeddings and pairs of embeddings are scored using a PLDA-based backend. We compare performance with a traditional i-vector baseline on NIST SRE 2010 and 2016. We find that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions. Moreover, the two representations are complementary, and their fusion improves on the baseline at all operating points. Similar systems have recently shown promising results when trained on very large proprietary datasets, but to the best of our knowledge, these are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
translated by 谷歌翻译
在过去几年中,自动说话人识别(ASV)的表现攻击检测(PAD)领域取得了重大进展。这包括开发新的语音语料库,标准评估协议以及前端特征提取和后端分类器的进步。 。标准数据库和评估协议的使用首次实现了对不同PAD解决方案的有意义的基准测试。本章总结了进展,重点关注过去三年完成的研究。本文概述了两个ASVspoof挑战的结果和经验教训,这是第一个以社区为主导的基准测试工作。这表明ASV PAD仍然是一个尚未解决的问题,需要进一步关注开发广泛的PAD解决方案,这些解决方案有可能检测出多样化和以前看不见的欺骗行为。攻击。
translated by 谷歌翻译
使用深度神经网络嵌入的说话者验证(SV)系统,即所谓的x向量系统,由于其优于i向量系统的良好性能而变得流行。这些系统的融合提供了改进的性能,受益于有区别训练的x向量和捕获不同说话者特征的生成i向量。在本文中,我们提出了一种新的方法来包括i向量和x向量的互补信息,这被称为生成x向量。生成x向量利用从背景数据的i-vectorand x向量表示中学习的变换模型。应用典型相关分析来推导该转换模型,该转换模型随后用于将登记和测试区段的标准x向量转换为相应的生成x向量。在NISTSRE 2010数据集上进行的SV实验表明,使用生成x-载体的系统提供比基线i-载体和x-载体系统明显更好的性能。此外,生成的x-向量优于i-向量和x-向量系统的融合,用于长时间的话语,同时为短时间的话语产生可比较的结果。
translated by 谷歌翻译
In 2016, the National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in robust text-independent speaker recognition, as well as measure performance of current state-of-the-art systems. Compared to previous NIST SREs, SRE16 introduced several new aspects including: an entirely online evaluation platform, a fixed training data condition, more variability in test segment duration (uni-formly distributed between 10s and 60s), the use of non-English (Cantonese, Cebuano, Mandarin and Tagalog) conversational telephone speech (CTS) collected outside North America, and providing labeled and unlabeled development (a.k.a. validation) sets for system hyperparameter tuning and adaptation. The introduction of the new non-English CTS data made SRE16 more challenging due to domain/channel and language mismatches as compared to previous SREs. A total of 66 research organizations from industry and academia registered for SRE16, out of which 43 teams submitted 121 valid system outputs that produced scores. This paper presents an overview of the evaluation and analysis of system performance over all primary evaluation conditions. Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.
translated by 谷歌翻译
While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
translated by 谷歌翻译
Learning speaker-specific features is vital in many applications like speaker recognition, diarization and speech recognition. This paper provides a novel approach, we term Neural Predictive Coding (NPC), to learn speaker-specific characteristics in a completely unsupervised manner from large amounts of unlabeled training data that even contain many non-speech events and multi-speaker audio streams. The NPC framework exploits the proposed short-term active-speaker sta-tionarity hypothesis which assumes two temporally-close short speech segments belong to the same speaker, and thus a common representation that can encode the commonalities of both the segments, should capture the vocal characteristics of that speaker. We train a convolutional deep siamese network to produce "speaker embeddings" by learning to separate 'same' vs 'different' speaker pairs which are generated from an unlabeled data of audio streams. Two sets of experiments are done in different scenarios to evaluate the strength of NPC embeddings and compare with state-of-the-art in-domain supervised methods. First, two speaker identification experiments with different context lengths are performed in a scenario with comparatively limited within-speaker channel variability. NPC embeddings are found to perform the best at short duration experiment, and they provide complementary information to i-vectors for full utterance experiments. Second, a large scale speaker verification task having a wide range of within-speaker channel variability is adopted as an upper-bound experiment where comparisons are drawn with in-domain supervised methods.
translated by 谷歌翻译
In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation , as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Can-tonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.
translated by 谷歌翻译
Speaker diarization is an important front-end for many speech technologies in the presence of multiple speakers, but current methods that employ i-vector clustering for short segments of speech are potentially too cumbersome and costly for the front-end role. In this work, we propose an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely. The proposed architecture simultaneously learns a fixed-dimensional embedding for acoustic segments of variable length and a scoring function for measuring the likelihood that the segments originated from the same or different speakers. Through tests on the CALLHOME conversational telephone speech corpus, we demonstrate that, in addition to streamlining the diariza-tion architecture, the proposed system matches or exceeds the performance of state-of-the-art baselines. We also show that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
translated by 谷歌翻译
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low dimensional speaker-and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a Support-Vector-Machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are: Within-Class Covariance Normalization (WCCN), Linear Discriminate Analysis (LDA), and Nuisance Attribute Projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an EER of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10sec-10sec condition compared to the classical joint factor analysis scoring.
translated by 谷歌翻译
In this work, we compare the performance of three modern speaker verification systems and non-expert human listeners in the presence of voice mimicry. Our goal is to gain insights on how vulnerable speaker verification systems are to mimicry attack and compare it to the performance of human listeners. We study both traditional Gaussian mixture model-universal background model (GMM-UBM) and an i-vector based classifier with cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring. For the studied material in Finnish language, the mimicry attack decreased lightly the equal error rate (EER) for GMM-UBM from 10.83 to 10.31, while for i-vector systems the EER increased from 6.80 to 13.76 and from 4.36 to 7.38. The performance of the human listening panel shows that imitated speech increases the difficulty of the speaker verification task. It is even more difficult to recognize a person who is intentionally concealing his or her identity. For Impersonator A, the average listener made 8 errors from 34 trials while the automatic systems had 6 errors in the same set. The average listener for Impersonator B made 7 errors from the 28 trials, while the automatic systems made 7 to 9 errors. A statistical analysis of the listener performance was also conducted. We found out a statistically significant association, with p ¼ 0:00019 and R 2 ¼ 0:59, between listener accuracy and self reported factors only when familiar voices were present in the test.
translated by 谷歌翻译
Reliability of Automatic Speaker Verification (ASV) systems has always been a concern in dealing with spoofing attacks. Among these attacks, replay attack is the simplest and the easiest accessible method. This paper describes a replay spoofing detection system applied to ASVspoof2017 corpus. To reach this goal, features such as Constant-Q Cepstral Coefficients (CQCC), Modified Group Delay (MGD), Mel Frequency Cepstral Coefficients (MFCC), Relative Spectral Perceptual Linear Predictive (RASTA-PLP) and Linear Prediction Cepstral Coefficients (LPCC), and different classifiers including Gaussian Mixture Models (GMM), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM) and Linear Gaussian (LG) classifier have been employed. We also used identity vector (i-vector) based utterance representation. Finally, scores of different subsystems have been fused to construct the proposed system. The results show that the best performance is attained using this score level fusion.
translated by 谷歌翻译
自动说话人验证(ASV)是将语音识别为生物识别的过程。 ASV系统表现出相当大的识别性能,具有来自匹配条件的足够语音。 ASV技术的一个主要挑战是通过短时间的语音段来提高识别性能。在短持续时间条件下,由于语音信息不足而无法正确估计模型参数,即使使用最先进的基于i矢量的ASV系统,这也导致识别准确性差。我们假设在考虑识别过程中的估计质量将有助于提高ASV性能。这可以作为ASV系统融合期间的质量测量。本文研究了直接从Baum-Welch统计计算出的语音容量的i向量表示的新质量度量。随后将所提出的度量用作ASV系统融合期间的质量测量。使用NIST SRE 2008语料库进行的实验,我们已经证明,包含所提出的质量度量标准在说话者验证性能方面表现出相当大的改善。结果还表明,在具有短测试话语的现实场景中,所提出的方法的潜力。
translated by 谷歌翻译
在本文中,我们探索使用分解层次变异分析编码器(FHVAE)模型来学习无监督潜在表示的方言识别(DID)。 FHVAE可以通过将它们编码为两组不同的潜在变量来学习将话语中的更多静态属性与更动态的属性分开的潜在空间。用于方言识别的有用因素,例如语音或语言内容,由分段潜变量编码,而序列内相对恒定的无关因子,例如频道或说话者信息,由顺序潜变量编码。解缠结特性使得这些潜在变量不易受到信道和扬声器变化的影响,从而减少了信道域不匹配的劣化。我们证明了在完全监督的DID任务中,与在传统声学特征和基于i矢量的系统上训练的相同模型相比,从FHVAE模型中提取的特征训练的端到端模型实现了最佳性能。此外,我们还表明,所提出的方法可以利用大量未标记的数据进行FHVAE培训,以学习DID的域不变特性,并在低资源条件下显着提高性能,其中域内数据的标签不是可用。
translated by 谷歌翻译
端到端深度学习语言或方言识别系统在频谱图或其他声学特征中操作,并直接生成每个类别的识别分数。端到端系统的一个重要问题是对应用程序域有一定的了解,因为系统可以容易地使用在训练阶段没有看到的情况;例如,场景通常被称为域不匹配条件。通常,我们假设训练数据集中存在足够的变化以将系统暴露给多个域。在这项工作中,我们研究如何最好地使用atraining数据集,以便在未知的目标域上获得最大的效果。我们的目标是在不了解targetdomain的情况下处理输入,同时在其他域上保留稳健的性能。为了实现这一目标,我们提出了一种域细致的融合方法,即前端到端的方言/语言识别系统。为了帮助进行实验,我们从三个不同的域收集数据集,并为域不匹配条件创建实验协议。我们提出的方法的结果在各种广播和YouTubedata上进行了测试,与传统方法相比,即使没有任何先前的目标域信息,也显示出显着的性能提升。
translated by 谷歌翻译
Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'. We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neu-ral Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains hundreds of thousands of 'real world' utterances for over 1,000 celebrities. Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance. We show that a CNN based architecture obtains the best performance for both identification and verification.
translated by 谷歌翻译
检测欺骗性话语是基于语音的生物识别中的基本问题。欺骗可以通过诸如语音合成,语音转换之类的逻辑访问或通过诸如重放预先录制的话语之类的物理访问来执行。受到最先进的基于x矢量的扬声器验证方法的启发,本文提出了一种深度神经网络(DNN)架构,用于从逻辑和物理访问中进行欺骗检测。与传统的基于DNN的系统相比,x向量方法的完整性在于它可以在测试期间处理可变长度的话语。在ASV-spoof-2019数据集上分析了所提出的x向量系统和基线高斯混合模型(GMM)系统的性能。所提出的系统超越了用于物理访问的GMM系统,而GMM系统更好地检测了逻辑访问。与GMM系统相比,所提出的x-vectorapproach给出了物理访问的平均相对改进14.64%。当与决策级特征切换(DLFS)范例相结合时,所提出的方法中的最佳系统优于具有相对性的最佳基线系统。在最小串联成本检测函数(min-t-DCF)方面,逻辑和物理访问的改进分别为67.48%和40.04%。
translated by 谷歌翻译