直接语音到语音翻译(S2ST)模型与传统级联系统可用的数据量相比,几乎没有平行的S2ST数据遇到数据稀缺问题,该数据包括自动语音识别(ASR),机器翻译(MT)和文本到语音(TTS)合成。在这项工作中,我们使用未标记的语音数据和数据扩展来探索自我监督的预训练,以解决此问题。我们利用了最近提出的语音到单位翻译(S2UT)框架,该框架将目标语音编码为离散表示形式,并转移前训练前和有效的部分填充技术,可很好地适用于语音到文本翻译(S2T)通过研究语音编码器和离散单位解码器预训练,S2UT域。我们在西班牙语 - 英语翻译上进行的实验表明,与多任务学习相比,自我监督的预训练始终如一地提高模型性能,平均为6.6-12.1 BLEU增长,并且可以与数据增强技术相结合,以应用MT来创建弱监督监督的培训数据。音频样本可在以下网址获得:https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html。
translated by 谷歌翻译
气道分割对于胸部CT图像分析至关重要。但是,由于固有的复杂树状结构和气道分支的不平衡大小,这仍然是一项具有挑战性的任务。当前的深度学习方法着眼于模型结构设计,而培训策略和损失功能的潜力尚未得到充分探索。因此,我们提出了一个简单而有效的气道分割管道,该管道表示为Naviairway,它发现具有支气管敏感的损失功能和人类视觉启发的迭代训练策略,发现了更细的细支气管。实验结果表明,Naverway的表现优于现有方法,尤其是在识别高产生的细支气管和对新CT扫描的鲁棒性方面。此外,纳维亚威是一般的。它可以与不同的骨干模型结合使用,并显着提高其性能。此外,我们建议对基于深度学习的气道细分方法进行更全面,更公平的评估,以更全面,更公平地评估。 Naveraway可以生成用于导航支气管镜检查的气道路线图,并且在生物医学图像中细分精细和长管结构时,也可以应用于其他情况。该代码可在https://github.com/antonotnawang/naviairway上公开获得。
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
我们提出了直接同时的语音转换(SIMUL-S2ST)模型,此外,翻译的产生与中间文本表示无关。我们的方法利用了最近与离散单位直接语音转换的最新进展,其中从模型中预测了一系列离散表示,而不是连续频谱图特征,而不是以无监督的方式学习,并直接传递给语音的声码器综合在一起。我们还介绍了变分单调的多口语注意力(V-MMA),以处理语音同声翻译中效率低效的政策学习的挑战。然后,同时策略在源语音特征和目标离散单元上运行。我们开展实证研究,比较级联和直接方法对Fisher西班牙语 - 英语和必需的英语西班牙语数据集。直接同步模型显示通过在翻译质量和延迟之间实现更好的权衡来优于级联模型。
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
translated by 谷歌翻译
Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
translated by 谷歌翻译
尽管视听模型与仅限音频模型相比可以产生卓越的性能和鲁棒性,但由于缺乏标记和未标记的视听数据以及每种方式部署一个模型的成本,它们的开发和采用受到阻碍。在本文中,我们提出了U-Hubert,这是一个自制的预训练框架,可以通过统一的蒙版群集预测目标来利用多模式和单峰语音。通过在预训练期间利用模态辍学,我们证明了一个微调模型可以在PAR上取得比较的性能或比最先进的模态特异性模型更好。此外,我们仅在音频上进行微调的模型可以通过视听和视觉语音输入来表现良好,从而实现了零击的模态概括,以实现语音识别和扬声器验证。特别是,我们的单个模型在带有音频/视听/视觉输入的LRS3上产生1.2%/1.4%/27.2%的语音识别单词错误率。
translated by 谷歌翻译
端到端的口语理解(SLU)使用单个模型直接从音频中预测意图。它有望通过利用中间文本表示中丢失的声学信息来提高助手系统的性能,并防止自动语音识别(ASR)中的级联错误。此外,在部署助手系统时,拥有一个统一模型具有效率优势。但是,具有语义解析标签的公共音频数据集有限的数量阻碍了该领域的研究进展。在本文中,我们发布了以任务为导向的语义解析(Stop)数据集,该数据集是公开可用的最大,最复杂的SLU数据集。此外,我们定义了低资源拆分,以建立有限的标记数据时改善SLU的基准。此外,除了人类录制的音频外,我们还发布了TTS生成版本,以基于端到端SLU系统的低资源域适应性的性能。最初的实验表明,端到端SLU模型的性能比级联的同行差一些,我们希望这能鼓励未来的工作。
translated by 谷歌翻译
本文调查了视听扬声器表示的自我监督的预训练,其中显示了视觉流,显示说话者的口腔区域与语音一起用作输入。我们的研究重点是视听隐藏单元BERT(AV-HUBERT)方法,该方法是最近开发的通用音频语音训练前训练框架。我们进行了广泛的实验,以探测预训练和视觉方式的有效性。实验结果表明,AV-Hubert可以很好地概括与说话者相关的下游任务,从而使标签效率提高了大约10倍的仅10倍,仅音频和视听扬声器验证。我们还表明,结合视觉信息,甚至仅仅是唇部区域,都大大提高了性能和噪声稳健性,在清洁条件下将EER降低了38%,在嘈杂的条件下将EER降低了75%。
translated by 谷歌翻译