自动预测主观听力测试的结果是一项具有挑战性的任务。即使听众之间的偏好是一致的,评分也可能因人而异。虽然先前的工作重点是预测单个刺激的听众评分(平均意见分数),但我们专注于预测主观偏好的更简单任务,即给出了两个语音刺激的同一文本。我们提出了一个基于抗对称双神经网络的模型,该模型是在波形对及其相应偏好分数上训练的。我们探索了注意力和复发性神经网,以说明一对刺激不符合时间的事实。为了获得大型训练集,我们将听众的评分从Mushra测试转换为反映这对中一种刺激的频率高于另一个刺激的值。具体而言,我们评估了从五年内进行的十二个Mushra评估获得的数据,其中包含不同扬声器数据的不同TTS系统。我们的结果与经过预测MOS得分的最先进模型相比有利。
translated by 谷歌翻译
本文报告了基准数据驱动的自动共鸣手势生成的第二个基因挑战。参与的团队使用相同的语音和运动数据集来构建手势生成系统。所有这些系统生成的运动都使用标准化的可视化管道将视频渲染到视频中,并在几个大型众包用户研究中进行了评估。与比较不同的研究论文不同,结果差异仅是由于方法之间的差异,从而实现了系统之间的直接比较。今年的数据集基于18个小时的全身运动捕获,包括手指,参与二元对话的不同人。十个团队参加了两层挑战:全身和上身手势。对于每个层,我们都评估了手势运动的人类风格及其对特定语音信号的适当性。我们的评估使人类的忠诚度与手势适当性解脱,这是该领域的主要挑战。评估结果是一场革命和启示。某些合成条件被评为比人类运动捕获更明显的人类样。据我们所知,这从未在高保真的头像上展示过。另一方面,发现所有合成运动比原始运动捕获记录要小得多。其他材料可通过项目网站https://youngwoo-yoon.github.io/geneachallenge2022/获得
translated by 谷歌翻译
神经序列到序列TTS已经实现了比使用HMMS的统计语音合成的显着更好的输出质量。然而,神经TTS通常不是概率,并且使用非单调注意都会增加训练时间并引入生产中不可接受的“唠叨”的失效模式。本文展示了旧的和新的范式可以组合以获得两个世界的优势,通过用神经网络定义的自回归左右跳过隐马尔可夫模型来取代塔克罗伦2的注意力。这导致基于HMM的神经TTS模型,具有单调对准,训练,以最大化没有近似的完整序列可能性。我们讨论如何将古典和当代TTS的创新结合起来的最佳效果。最终系统比Tacotron 2较小,更简单,并学会与较少的迭代和更少的数据说话,同时在网后达到相同的自然。与Tacotron 2不同,我们的系统还允许轻松控制口语率。音频示例和代码在https://shivammehta007.github.io/neural-hmm/处获得
translated by 谷歌翻译
舞蹈需要熟练的复杂动作,遵循音乐的节奏,音调和音色特征。正式地,在一段音乐上产生的舞蹈可以表达为建模高维连续运动信号的问题,该信号以音频信号为条件。在这项工作中,我们为解决这个问题做出了两项贡献。首先,我们提出了一种新颖的概率自回归体系结构,该体系结构使用多模式变压器编码器以先前的姿势和音乐背景为条件,以正常的流程为标准化姿势。其次,我们介绍了目前最大的3D舞蹈动机数据集,该数据集通过各种运动捕捉技术获得,包括专业和休闲舞者。使用此数据集,我们通过客观指标和一个用户研究将新模型与两个基准进行比较,并表明建模概率分布的能力以及能够通过大型运动和音乐背景进行的能力是必要的产生与音乐相匹配的有趣,多样和现实的舞蹈。
translated by 谷歌翻译
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of "learnable synergies", in which the model only selects relevant interactions between data modalities and keeps an "internal memory" of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.
translated by 谷歌翻译
Our goal with this survey is to provide an overview of the state of the art deep learning technologies for face generation and editing. We will cover popular latest architectures and discuss key ideas that make them work, such as inversion, latent representation, loss functions, training procedures, editing methods, and cross domain style transfer. We particularly focus on GAN-based architectures that have culminated in the StyleGAN approaches, which allow generation of high-quality face images and offer rich interfaces for controllable semantics editing and preserving photo quality. We aim to provide an entry point into the field for readers that have basic knowledge about the field of deep learning and are looking for an accessible introduction and overview.
translated by 谷歌翻译
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
translated by 谷歌翻译
Partitioning an image into superpixels based on the similarity of pixels with respect to features such as colour or spatial location can significantly reduce data complexity and improve subsequent image processing tasks. Initial algorithms for unsupervised superpixel generation solely relied on local cues without prioritizing significant edges over arbitrary ones. On the other hand, more recent methods based on unsupervised deep learning either fail to properly address the trade-off between superpixel edge adherence and compactness or lack control over the generated number of superpixels. By using random images with strong spatial correlation as input, \ie, blurred noise images, in a non-convolutional image decoder we can reduce the expected number of contrasts and enforce smooth, connected edges in the reconstructed image. We generate edge-sparse pixel embeddings by encoding additional spatial information into the piece-wise smooth activation maps from the decoder's last hidden layer and use a standard clustering algorithm to extract high quality superpixels. Our proposed method reaches state-of-the-art performance on the BSDS500, PASCAL-Context and a microscopy dataset.
translated by 谷歌翻译
Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play a crucial role in privacy preserving artificial intelligence and can also be used to augment small datasets. Here we show that diffusion probabilistic models can synthesize high quality medical imaging data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography (CT) images. We provide quantitative measurements of their performance through a reader study with two medical experts who rated the quality of the synthesized images in three categories: Realistic image appearance, anatomical correctness and consistency between slices. Furthermore, we demonstrate that synthetic images can be used in a self-supervised pre-training and improve the performance of breast segmentation models when data is scarce (dice score 0.91 vs. 0.95 without vs. with synthetic data).
translated by 谷歌翻译
可靠的自动可读性评估方法有可能影响各种领域,从机器翻译到自我信息学习。最近,用于德语语言的大型语言模型(例如Gbert和GPT-2-Wechsel)已获得,从而可以开发基于深度学习的方法,有望进一步改善自动可读性评估。在这项贡献中,我们研究了精细调整Gbert和GPT-2-Wechsel模型的合奏能够可靠地预测德国句子的可读性的能力。我们将这些模型与语言特征相结合,并研究了预测性能对整体大小和组成的依赖性。 Gbert和GPT-2-Wechsel的混合合奏表现要比仅由Gbert或GPT-2-Wechsel模型组成的相同大小的合奏表现更好。我们的模型在2022年的Germeval 2022中进行了评估,该任务是关于德国句子数据的文本复杂性评估。在样本外数据上,我们的最佳合奏达到了均方根误差为0.435。
translated by 谷歌翻译