有许多关于从深度神经网络(DNN)中提取瓶颈(BN)特征的研究,这些特征训练用于区分说话者,密码短语和三音素状态以改善文本相关说话者验证(TD-SV)的性能。但是,取得了一定的成功。最近的一项研究[1]提出了一种时间对比学习(TCL)概念,用于探索脑信号分类的大脑状态的平稳性。语音信号具有类似的非平稳性,并且TCL还具有不需要标记数据的优点。因此,我们提出了一种基于TCL的特征提取方法。该方法将训练数据集中的每个语音容差均匀地划分为预定义数量的多帧分段。话语中的每个片段对应于一个类,并且跨语言共享类标签。然后训练DNN以区分各类中的所有语音帧以利用语音的时间结构。此外,我们提出了一种基于分段的无监督聚类算法,以便为这些分段分配类别标签。在RedDots挑战数据库上进行TD-SV实验。使用从TD-SV评估集中排除的语音数据离线密码短语来训练TCL-DNN,因此可以将所述特征视为与短语无关的。我们比较了所提出的TCL瓶颈(BN)特征与短时间倒谱特征和从DNN识别扬声器,密码短语,说话者+密码短语以及标签和边界由三种不同自动生成的单声道提取的BN特征的性能。语音识别(ASR)系统。实验结果表明,所提出的TCL-BN优于自适应特征和说话人+通行短语判别BN特征,其性能与ASR导出的BN特征相当。此外,....
translated by 谷歌翻译
当人们沉浸在嘈杂的环境中时,人们倾向于改变他们的说话方式,这种反射被称为伦巴第效应。目前基于深度学习的语音增强系统通常不会在说话风格中考虑这种变化,因为它们是在安静条件下记录的中性(非伦巴第)语音训练中训练的,噪声是人工添加的。在本文中,我们研究了Lombard反射对基于深度学习的视听语音增强系统性能的影响。结果表明,在训练中性语音的系统和伦巴第语言训练系统之间的性能差距大约为5 dB。这表明在设计音频 - 视觉语音增强系统时考虑中性和伦巴第语音之间不匹配的好处。
translated by 谷歌翻译
视听语音增强(AV-SE)是使用来自讲话者的音频和视觉信息在嘈杂环境中改善语音质量和可懂度的任务。最近,已采用深度学习技术以监督方式解决AV-SE任务。在这种情况下,目标的选择,即要估计的数量,以及用于训练的量化该估计的质量的目标函数对于性能是至关重要的。这项工作是第一次对一系列不同目标和目标函数进行实验研究,用于训练基于深度学习的AV-SE系统。结果表明,直接估计掩模的方法在估计的语音质量和可懂度方面表现最佳,尽管直接估计对数幅度谱的模型在估计的语音质量方面表现良好。
translated by 谷歌翻译
家庭娱乐系统具有多种使用场景,具有一个或多个同时用户,在过去十年中,选择媒体的复杂性已经迅速增加。用户的决策过程复杂且受到上下文设置的高度影响,但支持上下文感知推荐系统的开发和评估的数据很少。在本文中,我们提供了一个自我报告的电视消费数据集,其中包含了观看情境的上下文信息。我们展示了genre的选择如何与当前用户的数量和用户的注意力水平相关联。此外,我们评估预测选择的genresgiven上下文信息的不同配置的性能,并将结果与​​无环境预测进行比较。结果表明,预测中包含的背景特征会带来显着的改善,时空和社会背景都会有显着的贡献。
translated by 谷歌翻译
In this paper, we propose novel strategies for neutral vector variabledecorrelation. Two fundamental invertible transformations, namely serialnonlinear transformation and parallel nonlinear transformation, are proposed tocarry out the decorrelation. For a neutral vector variable, which is notmultivariate Gaussian distributed, the conventional principal componentanalysis (PCA) cannot yield mutually independent scalar variables. With the twoproposed transformations, a highly negatively correlated neutral vector can betransformed to a set of mutually independent scalar variables with the samedegrees of freedom. We also evaluate the decorrelation performances for thevectors generated from a single Dirichlet distribution and a mixture ofDirichlet distributions. The mutual independence is verified with the distancecorrelation measurement. The advantages of the proposed decorrelationstrategies are intensively studied and demonstrated with synthesized data andpractical application evaluations.
translated by 谷歌翻译
In this paper, we present a time-contrastive learning (TCL) based bottleneck (BN) feature extraction method for speech signals with an application to text-dependent (TD) speaker verification (SV). It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure. More specifically, it trains deep neural networks (DNNs) to discriminate temporal events obtained by uniformly segmenting speech signals, in contrast to existing DNN based BN feature extraction methods that train DNNs using labeled data to discriminate speakers or pass-phrases or phones or a combination of them. In the context of speaker verification, speech data of fixed pass-phrases are used for TCL-BN training, while the pass-phrases used for TCL-BN training are excluded from being used for SV, so that the learned features can be considered generic. The method is evaluated on the RedDots Challenge 2016 database. Experimental results show that TCL-BN is superior to the existing speaker and pass-phrase discriminant BN features and the Mel-frequency cepstral coefficient feature for text-dependent speaker verification.
translated by 谷歌翻译
With the development of speech synthesis techniques, automatic speakerverification systems face the serious challenge of spoofing attack. In order toimprove the reliability of speaker verification systems, we develop a newfilter bank based cepstral feature, deep neural network filter bank cepstralcoefficients (DNN-FBCC), to distinguish between natural and spoofed speech. Thedeep neural network filter bank is automatically generated by training a filterbank neural network (FBNN) using natural and synthetic speech. By addingrestrictions on the training rules, the learned weight matrix of FBNN isband-limited and sorted by frequency, similar to the normal filter bank. Unlikethe manually designed filter bank, the learned filter bank has different filtershapes in different channels, which can capture the differences between naturaland synthetic speech more effectively. The experimental results on the ASVspoof{2015} database show that the Gaussian mixture model maximum-likelihood(GMM-ML) classifier trained by the new feature performs better than thestate-of-the-art linear frequency cepstral coefficients (LFCC) basedclassifier, especially on detecting unknown attacks.
translated by 谷歌翻译
Coresets are important tools to generate concise summaries of massive datasets for approximate analysis. A coreset is a small subset of points extracted from the original point set such that certain geometric properties are preserved with provable guarantees. This paper investigates the problem of maintaining a coreset to preserve the minimum enclosing ball (MEB) for a sliding window of points that are continuously updated in a data stream. Although the problem has been extensively studied in batch and append-only streaming settings, no efficient sliding-window solution is available yet. In this work, we first introduce an algorithm, called AOMEB, to build a coreset for MEB in an append-only stream. AOMEB improves the practical performance of the state-of-the-art algorithm while having the same approximation ratio. Furthermore, using AOMEB as a building block, we propose two novel algorithms, namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window with constant approximation ratios. The proposed algorithms also support coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally, extensive experiments on real-world and synthetic datasets demonstrate that SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the state-of-the-art batch algorithm while providing coresets for MEB with rather small errors compared to the optimal ones.
translated by 谷歌翻译
由于摆脱了手工制作表达的限制,有监督的深度学习技术在各个领域取得了巨大的成功。然而,大多数以前的图像重定向算法仍然采用固定的设计原则,如使用梯度图或手工制作的特征来计算地图,这不可避免地限制了它的通用性。深度学习技术可能有助于解决这个问题,但具有挑战性的问题是我们需要建立一个大规模的图像重定向数据集来训练深度重定向模型。但是,构建这样的数据集需要巨大的人性化。在本文中,我们提出了一种新的深度循环图像重定向方法,称为Cycle-IR,首先使用单个深度模型实现图像重定目标,而不依赖于任何明确的用户注释。我们的想法建立在从重定向图像到给定输入图像的反向映射之上。如果重定向图像具有严重失真或重要视觉信息的过度丢失,则反向映射不可能恢复输入图像。我们通过引入循环概念相干性损失来约束这种前向 - 反向一致性。此外,我们提出了一个简单而有效的图像重定位网络(IRNet)来实现图像重定向过程。 OurIRNet包含空间和通道关注层,能够有效地区分输入图像的视觉重要区域,特别是在杂乱的图像中。给定任意大小的输入图像和所需的纵横比,我们的Cycle-IR可以直接产生视觉上令人愉悦的目标图像。标准RetargetMe数据集上的广泛实验显示了我们的Cycle-IR的优越性。此外,我们的Cycle-IR优于Multiop方法,并在用户研究中获得最佳结果。代码可以通过以下网址获得://github.com/mintanwei/Cycle-IR。
translated by 谷歌翻译
Pre-training and fine-tuning, e.g., BERT (De-vlin et al., 2018), have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for encoder-decoder based language generation. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (sev-eral consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and language modeling. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation , text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over baselines without pre-training or with other pre-training methods. Specially, we achieve state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model (Bahdanau et al., 2015b).
translated by 谷歌翻译