With the fast development of big data, it has been easier than before to learn the optimal decision rule by updating the decision rule recursively and making online decisions. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.
translated by 谷歌翻译
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
在体育视频中跟踪多个运动员是一项非常具有挑战性的多对象跟踪(MOT)任务,因为运动员通常具有相同的外观并且彼此密切相同,因此使常见的遮挡问题成为一个令人讨厌的重复检测。在本文中,重复检测是新的,精确地定义为闭塞,通过一帧在多个检测箱上在同一运动员上误会。为了解决这个问题,我们精心设计了一种基于变压器的新型副本检测器(d $^3 $),用于培训,以及一种特定的算法拉力赛 - 亨加利亚(RH)进行匹配。一旦发生重复检测,D $^3 $立即通过生成增强框损耗来修改过程。由团队运动替代规则触发的RH极为适合体育视频。此外,为了补充没有拍摄更改的跟踪数据集,我们根据名为RallyTrack的体育视频发布了一个新数据集。在RallyTrack上进行了广泛的实验表明,将D $^3 $和RH结合起来,可以通过MOTA中的9.2和4.5在Hota中大幅提高跟踪性能。同时,关于Mot系列和Dancetrack的实验发现,D $^3 $可以在训练过程中加速融合,尤其是在MOT17上节省多达80%的原始培训时间。最后,我们的模型只能通过排球视频进行培训,可以直接应用于MAT的篮球和足球视频,该视频显示了我们方法的优先级。我们的数据集可从https://github.com/heruihr/rallytrack获得。
translated by 谷歌翻译
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.
translated by 谷歌翻译
封闭在野外的脸部图像中非常常见,导致面部相关任务的性能劣化。虽然致力于从面部图像中去除闭塞的努力,但遮挡的不同形状和纹理仍然挑战当前方法的稳健性。结果,目前的方法依赖于手动遮挡掩模或仅适用于特定的闭塞。本文提出了一种基于面部分割和3D面重建的新型面部去遮挡模型,其自动除去甚至模糊边界,例如,毛发。,毛发。所提出的模型包括3D面部重建模块,面部分割模块和图像生成模块。对于前两者预测的面部和遮挡掩模,图像生成模块可以忠实地恢复缺失的面部纹理。为了监督培训,我们进一步构建了一个大型遮挡数据集,双手动标记和合成闭塞。定性和定量结果证明了该方法的有效性和稳健性。
translated by 谷歌翻译
人类或语言模型创建的文本内容通常被对手被盗或滥用。跟踪文本出处可以帮助索取文本内容的所有权,或者标识分发误导内容的恶意用户,如机器生成的假新闻。有一些尝试实现这一目标,主要基于水印技术。具体而言,传统文本水印方法通过略微改变文本格式,如线间距和字体略微改变,但是,这是易碎的跨媒体传输,如OCR。考虑到这一点,自然语言水印方法通过用手工杂志资源(例如Wordnet)的同义词替换原始句子中的单词来代表水印,但他们不考虑替换对整体句子的意义的影响。最近,提出了一种基于变换器的网络来通过修改不引人注意的单词(例如,功能词)来嵌入水印,这也损害了句子的逻辑和语义连贯性。此外,一个训练有素的网络在其他不同类型的文本内容上都会失败。为了解决上述限制,我们提出了一种基于背景感知词汇替代(LS)的自然语言水印方案。具体而言,我们使用BERT来推断候选人与原句与原始句子之间的语义相关性建议LS候选。基于此,进一步设计了在同步性和替代性方面的选择策略,以测试一个单词是否完全适合于携带水印信号。广泛的实验表明,在客观和主观度量下,我们的水印方案可以很好地保持原始句子的语义完整性,并且具有比现有方法更好的可转换性。此外,拟议的LS方法优于斯坦福词语替代基准测试的最先进的方法。
translated by 谷歌翻译
在本文中,我们通过随机搜索方向的Kiefer-Wolfowitz算法调查了随机优化问题模型参数的统计参数问题。我们首先介绍了Polyak-ruppert-veriving型Kiefer-Wolfowitz(AKW)估计器的渐近分布,其渐近协方差矩阵取决于函数查询复杂性和搜索方向的分布。分布结果反映了统计效率与函数查询复杂性之间的权衡。我们进一步分析了随机搜索方向的选择来最小化渐变协方差矩阵,并得出结论,最佳搜索方向取决于相对于Fisher信息矩阵的不同摘要统计的最优标准。根据渐近分布结果,我们通过提供两个有效置信区间的结构进行一次通过统计推理。我们提供了验证我们的理论结果的数值实验,并通过程序的实际效果。
translated by 谷歌翻译
Federated learning has recently been applied to recommendation systems to protect user privacy. In federated learning settings, recommendation systems can train recommendation models only collecting the intermediate parameters instead of the real user data, which greatly enhances the user privacy. Beside, federated recommendation systems enable to collaborate with other data platforms to improve recommended model performance while meeting the regulation and privacy constraints. However, federated recommendation systems faces many new challenges such as privacy, security, heterogeneity and communication costs. While significant research has been conducted in these areas, gaps in the surveying literature still exist. In this survey, we-(1) summarize some common privacy mechanisms used in federated recommendation systems and discuss the advantages and limitations of each mechanism; (2) review some robust aggregation strategies and several novel attacks against security; (3) summarize some approaches to address heterogeneity and communication costs problems; (4)introduce some open source platforms that can be used to build federated recommendation systems; (5) present some prospective research directions in the future. This survey can guide researchers and practitioners understand the research progress in these areas.
translated by 谷歌翻译
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method -- note that this is not true for SGD or SLBFGS -- and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
translated by 谷歌翻译