This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
translated by 谷歌翻译
Downsampling and feature extraction are essential procedures for 3D point cloud understanding. Existing methods are limited by the inconsistent point densities of different parts in the point cloud. In this work, we analyze the limitation of the downsampling stage and propose the pre-abstraction group-wise window-normalization module. In particular, the window-normalization method is leveraged to unify the point densities in different parts. Furthermore, the group-wise strategy is proposed to obtain multi-type features, including texture and spatial information. We also propose the pre-abstraction module to balance local and global features. Extensive experiments show that our module performs better on several tasks. In segmentation tasks on S3DIS (Area 5), the proposed module performs better on small object recognition, and the results have more precise boundaries than others. The recognition of the sofa and the column is improved from 69.2% to 84.4% and from 42.7% to 48.7%, respectively. The benchmarks are improved from 71.7%/77.6%/91.9% (mIoU/mAcc/OA) to 72.2%/78.2%/91.4%. The accuracies of 6-fold cross-validation on S3DIS are 77.6%/85.8%/91.7%. It outperforms the best model PointNeXt-XL (74.9%/83.0%/90.3%) by 2.7% on mIoU and achieves state-of-the-art performance. The code and models are available at https://github.com/DBDXSS/Window-Normalization.git.
translated by 谷歌翻译
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. More interestingly, we can borrow the non-perfect category names, or even names from a foreign language, to improve the few-shot classification performance compared with random initialization. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37\% on ImageNet and 96.08\% on Stanford Cars, both using five-shot learning). We also investigate and analyze when the benefit of category names diminishes and how to use distillation to improve the performance of smaller models, providing guidance for future research.
translated by 谷歌翻译
We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO. Group DETR v2 achieves $\textbf{64.5}$ mAP on COCO test-dev, and establishes a new SoTA on the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-coco
translated by 谷歌翻译
大脑计算机界面(BCI)提供了人脑和外部设备之间的直接通信途径。在新受试者可以使用BCI之前,通常需要进行校准程序。因为间和受试者内部的差异是如此之大,以至于由现有受试者训练的模型在新受试者方面的表现不佳。因此,有效的主题转移和校准方法至关重要。在本文中,我们提出了一种半监督的元学习(SSML)方法,用于BCIS的主题转移学习。拟议的SSML首先学习了具有现有受试者的元模型,然后以半监督的学习方式对模型进行微调,即使用很少的标记和许多未标记的目标对象样本进行校准。对于标记数据稀缺或昂贵的同时,无标记数据的BCI应用程序非常重要。为了验证SSML方法,测试了三种不同的BCI范例:1)与事件相关的潜在检测; 2)情绪识别; 3)睡眠舞台。 SSML在前两个范式上取得了显着提高15%,而第三个范式则达到4.9%。实验结果证明了SSML方法在BCI应用中的有效性和潜力。
translated by 谷歌翻译
在工业4.0中,现代制造和自动化工作场所的剩余寿命(RUL)预测至关重要。显然,这是连续的工具磨损,或更糟糕的是,突然的机器故障会导致各种制造故障,这显然会导致经济损失。借助深度学习方法的可用性,将其用于RUL预测的巨大潜力和前景导致了几种模型,这些模型是由制造机的操作数据驱动的。目前,基于完全监督模型的这些努力严重依赖于其规定标记的数据。但是,只有在机器崩溃发生后才能获得所需的RUL预测数据(即来自错误和/或降解机器的注释和标记的数据)。现代制造和自动化工作场所中破碎的机器在现实情况下的稀缺性增加了获得足够注释和标记数据的困难。相比之下,从健康机器中收集的数据要容易得多。因此,我们指出了这一挑战以及提高有效性和适用性的潜力,因此我们提出(并充分开发)一种基于掩盖自动编码器的概念的方法,该方法将利用未标记的数据进行自学。因此,在这里的工作中,开发和利用了一种值得注意的掩盖自我监督的学习方法。这旨在通过利用未标记的数据来建立一个深度学习模型,以实现RUL预测。在C-MAPSS数据集中实施了验证该开发有效性的实验(这些实验是从NASA Turbofan发动机的数据中收集的)。结果清楚地表明,与使用全面监督模型相比,我们在这里的发展和方法在准确性和有效性上都表现得更好。
translated by 谷歌翻译
3D零件分割是高级CAM/CAD工作流程中的重要步骤。精确的3D细分有助于降低制造设备(例如计算机控制的CNC)生产的工作配件的缺陷率,从而提高了工作效率并获得了随之而来的经济利益。在3D模型分割上进行的大量现有作品主要基于完全监督的学习,该学习训练AI模型具有大型,带注释的数据集。但是,缺点是,完全监督的学习方法的最终模型高度依赖于可用数据集的完整性,并且其概括能力对新的未知细分类型(即其他新颖的类别)相对较差。在这项工作中,我们提出并开发了一种值得注意的基于学习的方法,以在CAM/CAD中进行有效的部分分割;这旨在显着增强其概括能力,并通过仅使用相对较少的样本灵活地适应新的分割任务。结果,它不仅减少了通常无法实现和详尽的监督数据集完整性的要求,而且还提高了对现实世界应用程序的灵活性。作为进一步的改进和创新,我们还采用了网络中的转换网和中心损失块。这些特征有助于提高整个工作人员各种可能实例的3D特征的理解,并确保在特征空间中同一类的密切分布。此外,我们的方法以降低空间消耗的点云格式存储数据,并且还使所涉及的各种过程变得更加容易阅读和编辑访问(从而提高了效率和有效性并降低了成本)。
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译
最近,自主驾驶社会上有许多进展,吸引了学术界和工业的很多关注。然而,现有的作品主要专注于汽车,自动驾驶卡车算法和模型仍然需要额外的开发。在本文中,我们介绍了智能自动驾驶卡车系统。我们所呈现的系统由三个主要组成部分组成,1)一个现实的交通仿真模块,用于在测试场景中产生现实的交通流量,2)设计和评估了在现实世界部署中模仿实际卡车响应的高保真卡车模型,3 )具有基于学习的决策算法和多模轨迹策划仪的智能计划模块,考虑到卡车的约束,道路斜率变化和周围的交通流量。我们为每个组分单独提供定量评估,以证明每个部件的保真度和性能。我们还将我们的建议系统部署在真正的卡车上,并进行真实的世界实验,表明我们的系统能力缓解了SIM-TO-REAL差距。我们的代码可以在https://github.com/inceptioresearch/iits提供
translated by 谷歌翻译