Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
尽管预训练的语言模型(PLM)已经在各种自然语言处理(NLP)任务上实现了最先进的表现,但在处理知识驱动的任务时,它们被证明缺乏知识。尽管为PLM注入知识做出了许多努力,但此问题仍然开放。为了应对挑战,我们提出\ textbf {dictbert},这是一种具有词典知识增强PLM的新方法,比知识图(kg)更容易获取。在预训练期间,我们通过对比度学习将两个新颖的预训练任务注入plms:\ textit {dictionary输入预测}和\ textit {entriT {输入描述歧视}。在微调中,我们使用预先训练的dictbert作为插件知识库(KB)来检索输入序列中确定的条目的隐式知识,并将检索到的知识注入输入中,以通过新颖的额外跳动来增强其表示形式注意机制。我们评估了我们的方法,包括各种知识驱动和语言理解任务,包括NER,关系提取,CommonSenseQA,OpenBookQa和Glue。实验结果表明,我们的模型可以显着改善典型的PLM:它的大幅提高了0.5 \%,2.9 \%,9.0 \%,7.1 \%\%和3.3 \%,分别对Roberta--也有效大的。
translated by 谷歌翻译
视频文本检索一直是多模式研究中的至关重要和基本任务。大型多模式对比预训练的发展,视频文本检索的开发已大大促进,这主要侧重于粗粒或细粒对比。然而,在先前的研究中很少探索过跨粒度的对比,这是粗粒表示和细粒度表示之间的对比。与细粒度或粗粒的对比相比,交叉粒度对比度计算了粗粒粒度特征与每个细粒特征之间的相关性,并且能够过滤出不必要的细颗粒特征,这些特征由粗粒度的特征引导相似性计算,从而提高了检索的准确性。为此,本文提出了一种新型的多透明对比模型,即X-CLIP,用于视频文本检索。但是,另一个挑战在于相似性聚集问题,该问题旨在将细粒度和跨粒度相似性矩阵与实例级别的相似性汇总。为了应对这一挑战,我们提出了对相似性矩阵(AOSM)模块的关注,以使模型重点放在基本帧和单词之间的对比度上,从而降低了不必要的帧和单词对检索结果的影响。 X-CLIP具有多透明的对比度和提议的AOSM模块,在五个广泛使用的视频文本检索数据集上取得了出色的性能,包括MSR-VTT(49.3 R@1),MSVD(50.4 R@1),LSMDC(26.11)(26.1 r@1),didemo(47.8 r@1)和ActivityNet(46.2 r@1)。它的表现优于先前的最先前, +6.3%, +6.6%, +11.1%, +6.7%, +3.8%的相对改善对这些基准测试,这表明了多透明的对比度和AOSM的优势。
translated by 谷歌翻译
数字医学图像的机器学习和流行的最新进展已经开辟了通过使用深卷积神经网络来解决挑战性脑肿瘤细分(BTS)任务的机会。然而,与非常广泛的RGB图像数据不同,在脑肿瘤分割中使用的医学图像数据在数据刻度方面相对稀缺,但在模态属性方面包含更丰富的信息。为此,本文提出了一种新的跨模型深度学习框架,用于从多种方式MRI数据分段脑肿瘤。核心思想是通过多模态数据挖掘丰富的模式以弥补数据量表不足。所提出的跨型号深度学习框架包括两个学习过程:跨模型特征转换(CMFT)过程和跨模型特征融合(CMFF)过程,其目的是通过跨越不同模态的知识来学习丰富的特征表示数据和融合知识分别来自不同的模态数据。在Brats基准上进行了综合实验,表明,与基线方法和最先进的方法相比,所提出的跨模型深度学习框架可以有效地提高大脑肿瘤分割性能。
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译