Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
translated by 谷歌翻译
Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a PDE governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation characterizing the noise-conditional scores of the perturbed data densities (i.e., their gradients), termed the score FPE. Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We mathematically analyze three implications of satisfying the score FPE and a potential explanation for why the score FPE is not satisfied in practice. At last, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and show its effectiveness on synthetic data and MNIST.
translated by 谷歌翻译
一个著名的矢量定量变分自动编码器(VQ-VAE)的问题是,学识渊博的离散表示形式仅使用代码书的全部容量的一小部分,也称为代码书崩溃。我们假设VQ-VAE的培训计划涉及一些精心设计的启发式方法,这是这个问题的基础。在本文中,我们提出了一种新的训练方案,该方案通过新颖的随机去量化和量化扩展标准VAE,称为随机量化变异自动编码器(SQ-VAE)。在SQ-VAE中,我们观察到一种趋势,即在训练的初始阶段进行量化是随机的,但逐渐收敛于确定性量化,我们称之为自宣传。我们的实验表明,SQ-VAE在不使用常见启发式方法的情况下改善了代码书的利用率。此外,我们从经验上表明,在视觉和语音相关的任务中,SQ-VAE优于VAE和VQ-VAE。
translated by 谷歌翻译
图像生成模型可以学习训练数据的分布,因此可以通过从这些分布中取样来生成示例。但是,当培训数据集被离群值损坏时,生成模型可能会产生与异常值相似的示例。实际上,一小部分离群值可能会诱导最新的生成模型,例如量化量化量化自动编码器(VQ-VAE),以从异常值中学习重要的模式。为了减轻此问题,我们提出了一个基于VQ-VAE的强大生成模型,我们将其命名为强大的VQ-VAE(RVQ-VAE)。为了实现鲁棒性,RVQ-VAE使用两个单独的代码簿对嵌入式和离群值。为了确保代码簿嵌入正确的组件,我们在每个培训时期内迭代更新嵌入式和异常值的集合。为了确保编码的数据点与正确的代码簿匹配,我们使用加权欧几里得距离进行量化,其权重由代码簿的定向差异确定。这两个代码手册均与编码器和解码器一起根据重建损失和量化损失共同训练。我们在实验上证明,即使大部分训练数据点损坏了RVQ-VAE,即使大部分培训数据都可以从嵌入式中产生示例。
translated by 谷歌翻译
变异自动编码器(VAE)经常遭受后塌陷,这是一种现象,其中学习过的潜在空间变得无知。这通常与类似于数据差异的高参数有关。此外,如果数据方差不均匀或条件性,则确定这种适当的选择将变得不可行。因此,我们提出了具有数据方差的广义参数化的VAE扩展,并将最大似然估计纳入目标函数中,以适应解码器平滑度。由提议的VAE扩展产生的图像显示,MNIST和Celeba数据集上的Fr \'Echet Inception距离(FID)得到了改善。
translated by 谷歌翻译
With the fast development of big data, it has been easier than before to learn the optimal decision rule by updating the decision rule recursively and making online decisions. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.
translated by 谷歌翻译
Model counting is a fundamental problem which has been influential in many applications, from artificial intelligence to formal verification. Due to the intrinsic hardness of model counting, approximate techniques have been developed to solve real-world instances of model counting. This paper designs a new anytime approach called PartialKC for approximate model counting. The idea is a form of partial knowledge compilation to provide an unbiased estimate of the model count which can converge to the exact count. Our empirical analysis demonstrates that PartialKC achieves significant scalability and accuracy over prior state-of-the-art approximate counters, including satss and STS. Interestingly, the empirical results show that PartialKC reaches convergence for many instances and therefore provides exact model counting performance comparable to state-of-the-art exact counters.
translated by 谷歌翻译
Robots are traditionally bounded by a fixed embodiment during their operational lifetime, which limits their ability to adapt to their surroundings. Co-optimizing control and morphology of a robot, however, is often inefficient due to the complex interplay between the controller and morphology. In this paper, we propose a learning-based control method that can inherently take morphology into consideration such that once the control policy is trained in the simulator, it can be easily deployed to robots with different embodiments in the real world. In particular, we present the Embodiment-aware Transformer (EAT), an architecture that casts this control problem as conditional sequence modeling. EAT outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired robot embodiment, past states, and actions, our EAT model can generate future actions that best fit the current robot embodiment. Experimental results show that EAT can outperform all other alternatives in embodiment-varying tasks, and succeed in an example of real-world evolution tasks: stepping down a stair through updating the morphology alone. We hope that EAT will inspire a new push toward real-world evolution across many domains, where algorithms like EAT can blaze a trail by bridging the field of evolutionary robotics and big data sequence modeling.
translated by 谷歌翻译
Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes. We provide extensive experiments to show how dialogue context and visual signals benefit persuasion strategy prediction. We also explore the generalization ability of language models for persuasion modeling and the role of persuasion strategies in predicting social deduction game outcomes. Our dataset, code, and models can be found at https://persuasion-deductiongame.socialai-data.org.
translated by 谷歌翻译
Deep reinforcement learning has recently emerged as an appealing alternative for legged locomotion over multiple terrains by training a policy in physical simulation and then transferring it to the real world (i.e., sim-to-real transfer). Despite considerable progress, the capacity and scalability of traditional neural networks are still limited, which may hinder their applications in more complex environments. In contrast, the Transformer architecture has shown its superiority in a wide range of large-scale sequence modeling tasks, including natural language processing and decision-making problems. In this paper, we propose Terrain Transformer (TERT), a high-capacity Transformer model for quadrupedal locomotion control on various terrains. Furthermore, to better leverage Transformer in sim-to-real scenarios, we present a novel two-stage training framework consisting of an offline pretraining stage and an online correction stage, which can naturally integrate Transformer with privileged training. Extensive experiments in simulation demonstrate that TERT outperforms state-of-the-art baselines on different terrains in terms of return, energy consumption and control smoothness. In further real-world validation, TERT successfully traverses nine challenging terrains, including sand pit and stair down, which can not be accomplished by strong baselines.
translated by 谷歌翻译