Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.
translated by 谷歌翻译
Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video. Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention. This limits their sensitivity to localize action boundaries. To this end, we propose a prior-enhanced temporal action localization method (PETAL), which only takes in RGB input and incorporates action subjects as priors. This proposal leverages action subjects' information with a plug-and-play subject-aware spatial attention module (SA-SAM) to generate an aggregated and subject-prioritized representation. Experimental results on THUMOS-14 and ActivityNet-1.3 datasets demonstrate that the proposed PETAL achieves competitive performance using only RGB features, e.g., boosting mAP by 2.41% or 0.25% over the state-of-the-art approach that uses RGB features or with additional optical flow features on the THUMOS-14 dataset.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
This study investigates clustered federated learning (FL), one of the formulations of FL with non-i.i.d. data, where the devices are partitioned into clusters and each cluster optimally fits its data with a localized model. We propose a novel clustered FL framework, which applies a nonconvex penalty to pairwise differences of parameters. This framework can automatically identify clusters without a priori knowledge of the number of clusters and the set of devices in each cluster. To implement the proposed framework, we develop a novel clustered FL method called FPFC. Advancing from the standard ADMM, our method is implemented in parallel, updates only a subset of devices at each communication round, and allows each participating device to perform a variable amount of work. This greatly reduces the communication cost while simultaneously preserving privacy, making it practical for FL. We also propose a new warmup strategy for hyperparameter tuning under FL settings and consider the asynchronous variant of FPFC (asyncFPFC). Theoretically, we provide convergence guarantees of FPFC for general nonconvex losses and establish the statistical convergence rate under a linear model with squared loss. Our extensive experiments demonstrate the advantages of FPFC over existing methods.
translated by 谷歌翻译
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.
translated by 谷歌翻译
本文介绍了蒙古人的高质量开源文本到语音(TTS)合成数据集,蒙古是一种低资源的语言,该语言是全球超过1000万人所讲的。该数据集名为MNTTS,由一位22岁专业女性蒙古播音员说的大约8个小时的录音录音组成。它是第一个开发的公开数据集,旨在促进学术界和行业中的蒙古TTS应用程序。在本文中,我们通过描述数据集开发程序并面临挑战来分享我们的经验。为了证明数据集的可靠性,我们建立了一个基于FastSpeech2模型和HIFI-GAN Vocoder的强大的非自动回调基线系统,并使用主观平均意见分数(MOS)和实时因素(RTF)指标对其进行了评估。评估结果表明,在我们的数据集上训练的功能强大的基线系统可在4和RTF上获得MOS,大约3.30美元\ times10^{ - 1} $,这使其适用于实际使用。数据集,培训配方和预估计的TTS模型是免费可用的\ footNote {\ label {github} \ url {https://github.com/walker.com/walker-hyf/mntts}}}。
translated by 谷歌翻译
预先训练的语言模型在多大程度上了解有关分发性现象的语义知识?在本文中,我们介绍了Distnli,这是一种新的自然语言推理诊断数据集,该数据集针对分布式引起的语义差异,并采用因果中介分析框架来量化模型行为并探索该语义相关任务中的基本机制。我们发现,模型的理解程度与模型大小和词汇大小有关。我们还提供有关模型如何编码这种高级语义知识的见解。
translated by 谷歌翻译
连续的关系提取(CRE)要求该模型不断从课堂收入数据流中学习新关系。在本文中,我们提出了一种令人沮丧的简单但有效的方法(FEA)方法,其中有两个学习阶段的CRE:1)快速适应(FA)仅使用新数据加热模型。 2)平衡调整(BT)列出平衡内存数据上的模型。尽管它很简单,但FEA与最先进的基线相比,FEA取得了可比性(在诱人或优越(在少数情况下)性能。通过仔细的检查,我们发现新关系之间的数据失衡会导致偏斜的决策边界在预计编码器上的头部分类器中,从而损害了整体性能。在FEA中,FA阶段释放了后续填充的内存数据的潜力,而BT阶段有助于建立更平衡的决策边界。通过统一的视图,我们,我们发现可以将两个强大的CRE基准列入提议的培训管道中。FEEA的成功还为CRE中的未来模型设计提供了可行的见解和建议。
translated by 谷歌翻译
在这项工作中,我们探讨了用于语义分割知识蒸馏的数据增强。为了避免过度适合教师网络中的噪音,大量培训示例对于知识蒸馏至关重要。 Imagelevel论证技术(例如翻转,翻译或旋转)在先前的知识蒸馏框架中广泛使用。受到功能空间上语义方向的最新进展的启发,我们建议在功能空间中包括以进行有效蒸馏的功能。具体而言,给定语义方向,可以在功能空间中为学生获得无限数量的增强。此外,分析表明,可以通过最大程度地减少增强损失的上限来同时优化这些增强。基于观察结果,开发了一种用于语义分割的知识蒸馏的新算法。对四个语义分割基准测试的广泛实验表明,所提出的方法可以提高当前知识蒸馏方法的性能而没有任何明显的开销。代码可在以下网址获得:https://github.com/jianlong-yuan/fakd。
translated by 谷歌翻译
尽管在过去几年中取得了重大进展,但使用单眼图像进行深度估计仍然存在挑战。首先,训练度量深度预测模型的训练是不算气的,该预测模型可以很好地推广到主要由于训练数据有限的不同场景。因此,研究人员建立了大规模的相对深度数据集,这些数据集更容易收集。但是,由于使用相对深度数据训练引起的深度转移,现有的相对深度估计模型通常无法恢复准确的3D场景形状。我们在此处解决此问题,并尝试通过对大规模相对深度数据进行训练并估算深度转移来估计现场形状。为此,我们提出了一个两阶段的框架,该框架首先将深度预测到未知量表并从单眼图像转移,然后利用3D点云数据来预测深度​​移位和相机的焦距,使我们能够恢复恢复3D场景形状。由于两个模块是单独训练的,因此我们不需要严格配对的培训数据。此外,我们提出了图像级的归一化回归损失和基于正常的几何损失,以通过相对深度注释来改善训练。我们在九个看不见的数据集上测试我们的深度模型,并在零拍摄评估上实现最先进的性能。代码可用:https://git.io/depth
translated by 谷歌翻译