为了处理变异长度的长视频,先前的作品提取了多模式功能并将其融合以预测学生的参与强度。在本文中,我们在视频变压器(CAVT)中提出了一个新的端到端方法类的关注,该方法涉及一个向量来处理类嵌入并均匀地对变异长的视频和固定的端到端学习 - 长度短视频。此外,为了解决缺乏足够的样本,我们提出了一种二进制代表采样方法(BOR)来添加每个视频的多个视频序列以增强训练集。BORS+CAVT不仅可以在EMOTIW-EP数据集上实现最先进的MSE(0.0495),而且还可以在Daisee数据集上获得最新的MSE(0.0377)。代码和模型将在https://github.com/mountainai/cavt上公开提供。
translated by 谷歌翻译
视频文本发现(VTS)是需要同时检测,跟踪和识别视频中文本的任务。现有的视频文本发现方法通常开发复杂的管道和多个模型,这不是实时应用程序的朋友。在这里,我们提出了一个带有对比表示学习(Cotext)的实时端到端视频文本检测器。我们的贡献分为三个:1)Cotext同时解决实时端到端可训练框架中的三个任务(例如,文本检测,跟踪,识别)。 2)通过对比度学习,Cotext模拟了多个帧的长距离依赖性和学习时间信息。 3)简单,轻巧的体系结构设计用于有效和准确的性能,包括带有蒙版ROI的基于CTC的GPU - 平行检测后处理。广泛的实验显示了我们方法的优越性。尤其是,Cotext在ICDAR2015VIDEO上以41.0 fps的速度实现了一个视频文本,以72.0%的IDF1命中,其video的范围为10.5%和32.0 fps,改进了先前的最佳方法。该代码可以在github.com/weijiawu/cotext上找到。
translated by 谷歌翻译
开放式视频对象检测(OVD)旨在扩展词汇大小,以检测训练词汇以外的新颖类别的对象。最近的工作诉诸于预先训练的视觉模型中的丰富知识。但是,现有方法在提案级视觉语言对准方面无效。同时,这些模型通常遭受对基本类别的信心偏见,并且在新颖的类别上表现较差。为了克服挑战,我们提出了Medet,这是一个新颖有效的OVD框架,并具有建议挖掘和预测均衡。首先,我们设计了一个在线建议挖掘,以完善从粗到细的继承的视觉语义知识,从而允许提案级别以检测为导向的特征对齐。其次,基于因果推论理论,我们引入了班级的后门调整,以加强对新类别的预测,以提高整体OVD性能。对可可和LVIS基准的广泛实验验证了MEDET在检测新型类别的对象(例如可可的32.6%AP50)和LVI上的22.4%蒙版图中的优越性。
translated by 谷歌翻译
视觉变压器(VIT)正在改变对象检测方法的景观。 VIT的自然使用方法是用基于变压器的骨干替换基于CNN的骨干,该主链很简单有效,其价格为推理带来了可观的计算负担。更微妙的用法是DEDR家族,它消除了对物体检测中许多手工设计的组件的需求,但引入了一个解码器,要求超长时间进行融合。结果,基于变压器的对象检测不能在大规模应用中占上风。为了克服这些问题,我们提出了一种新型的无解码器基于完全变压器(DFFT)对象检测器,这是第一次在训练和推理阶段达到高效率。我们通过居中两个切入点来简化反对检测到仅编码单级锚点的密集预测问题:1)消除训练感知的解码器,并利用两个强的编码器来保留单层特征映射预测的准确性; 2)探索具有有限的计算资源的检测任务的低级语义特征。特别是,我们设计了一种新型的轻巧的面向检测的变压器主链,该主链有效地捕获了基于良好的消融研究的丰富语义的低级特征。 MS Coco基准测试的广泛实验表明,DFFT_SMALL的表现优于2.5%AP,计算成本降低28%,$ 10 \ $ 10 \乘以$ 10 \乘以$较少的培训时期。与尖端的基于锚的探测器视网膜相比,DFFT_SMALL获得了超过5.5%的AP增益,同时降低了70%的计算成本。
translated by 谷歌翻译
几乎所有场景文本发现(检测和识别)方法依赖于昂贵的框注释(例如,文本线框,单词级框和字符级框)。我们首次证明培训场景文本发现模型可以通过每个实例的单点的极低成本注释来实现。我们提出了一种端到端的场景文本发现方法,将场景文本拍摄作为序列预测任务,如语言建模。给予图像作为输入,我们将所需的检测和识别结果作为一系列离散令牌制定,并使用自动回归变压器来预测序列。我们在几个水平,多面向和任意形状的场景文本基准上实现了有希望的结果。最重要的是,我们表明性能对点注释的位置不是很敏感,这意味着它可以比需要精确位置的边界盒更容易地注释并自动生成。我们认为,这种先锋尝试表明了场景文本的重要机会,比以前可能的比例更大的比例更大。
translated by 谷歌翻译
基于深度学习的路面裂缝检测方法通常需要大规模标签,具有详细的裂缝位置信息来学习准确的预测。然而,在实践中,由于路面裂缝的各种视觉模式,裂缝位置很难被手动注释。在本文中,我们提出了一种基于深域适应的裂缝检测网络(DDACDN),其学会利用源域知识来预测目标域中的多类别裂缝位置信息,其中仅是图像级标签可用的。具体地,DDACDN首先通过双分支权重共享骨干网络从源和目标域中提取裂缝特征。并且在实现跨域自适应的努力中,通过从每个域的特征空间聚合三尺度特征来构建中间域,以使来自源域的裂缝特征适应目标域。最后,该网络涉及两个域的知识,并接受识别和本地化路面裂缝的培训。为了便于准确的培训和验证域适应,我们使用两个具有挑战性的路面裂缝数据集CQu-BPDD和RDD2020。此外,我们构建了一个名为CQu-BPMDD的新型大型沥青路面多标签疾病数据集,其中包含38994个高分辨率路面疾病图像,以进一步评估模型的稳健性。广泛的实验表明,DDACDN优于最先进的路面裂纹检测方法,以预测目标结构域的裂缝位置。
translated by 谷歌翻译
尽管自我监督的表示学习(SSL)受到社区的广泛关注,但最近的研究认为,当模型大小降低时,其性能将遭受悬崖的下降。当前的方法主要依赖于对比度学习来训练网络,在这项工作中,我们提出了一种简单而有效的蒸馏对比学习(Disco),以大幅度减轻问题。具体而言,我们发现主流SSL方法获得的最终嵌入包含最富有成果的信息,并建议提炼最终的嵌入,以最大程度地将教师的知识传播到轻量级模型中,通过约束学生的最后嵌入与学生的最后嵌入,以使其与该模型保持一致。老师。此外,在实验中,我们发现存在一种被称为蒸馏瓶颈的现象,并存在以扩大嵌入尺寸以减轻此问题。我们的方法在部署过程中不会向轻型模型引入任何额外的参数。实验结果表明,我们的方法在所有轻型模型上都达到了最先进的作用。特别是,当使用RESNET-101/RESNET-50用作教师教授有效网络-B0时,Imagenet上有效网络B0的线性结果非常接近Resnet-101/Resnet-50,但是有效网络B0的参数数量仅为9.4 \%/16.3 \%Resnet-101/resnet-50。代码可从https:// github获得。 com/yuting-gao/disco-pytorch。
translated by 谷歌翻译
We present a new, embarrassingly simple approach to instance segmentation. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the "detect-then-segment" strategy (e.g., Mask R-CNN), or predict embedding vectors first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance segmentation into a single-shot classification-solvable problem. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy. We hope that this simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation. Code is available at https://git.io/AdelaiDet
translated by 谷歌翻译
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
translated by 谷歌翻译
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.
translated by 谷歌翻译