视频和文本之间的跨模式检索因网络上的视频迅速出现而越来越多。通常,视频包含丰富的实例和事件信息,查询文本仅描述了信息的一部分。因此,视频可以对应于多个不同的文本说明和查询。我们将此现象称为``视频文本对应歧义''问题。当前技术主要集中于挖掘视频和文本内容之间的本地或多级对齐(\ textit {e.g。},对实体和动词的动作对象)。这些方法很难通过仅使用一个单个功能来描述视频来减轻视频文本的歧义,这需要同时与多个不同的文本功能匹配。为了解决这个问题,我们提出了一个文本自适应多个视觉原型匹配模型,该模型会自动捕获多个原型,以通过自适应聚合视频令牌功能来描述视频。给定查询文本,相似性由最相似的原型确定,以在视频中找到对应关系,该视频称为文本自适应匹配。为了学习代表视频中丰富信息的多种原型,我们提出了差异损失,以鼓励不同的原型参与视频的不同内容。我们的方法在四个公共视频检索数据集上优于最先进的方法。
translated by 谷歌翻译
我们研究了可靠的功能表示的任务,旨在在多个数据集上良好地概括以进行行动识别。我们建立了有关变形金刚的功效的方法。尽管在过去的十年中,我们目睹了视频动作识别的巨大进展,但如何培训单个模型可以在多个数据集中表现良好的单一模型仍然充满挑战而有价值。在这里,我们提出了一种新颖的多数据集训练范式,Multitrain,设计了两个新的损失条款,即信息丰富的损失和投射损失,旨在学习稳健的表现以进行行动识别。特别是,信息性损失最大化了功能嵌入的表现力,而每个数据集的投影损失遍历了数据集的类之间的内在关系。我们验证方法对五个具有挑战性的数据集的有效性,即动力学400,动力学700,矩矩,活动网络和某种效果 - v2数据集。广泛的实验结果表明,我们的方法可以始终如一地提高最新性能。
translated by 谷歌翻译
真实世界的文本应用程序通常涉及组成广泛的文本控制操作,例如编辑文本W.R.T.属性,操纵关键字和结构,并生成所需属性的新文本。事先的工作通常会学习/芬太尼语言模型(LM)以执行操作的个人或特定子集。最近的研究以插件方式研究了合并操作,通常在复杂序列空间中以昂贵的搜索或优化进行了研究。本文提出了一种新的有效方法,用于在紧凑的文本潜在空间中进行可复合的文本操作。文本潜在矢量的低维度和不同性使我们能够基于给定的任意插入运算符(例如属性分类器)基于普通微分方程(ODE)开发有效的采样器。通过通过有效的适应性将预告片的LMS(例如GPT2)连接到潜在空间,然后我们将采样向量解码为所需的文本序列。灵活的方法允许使用来自不同域中的任何相关数据获取的各种控制操作员(情感,时态,形式,关键字等)。实验表明,在我们的方法中构成这些操作员可以生成或编辑高质量文本,从而在发电质量和效率方面显着改善了以前的方法。
translated by 谷歌翻译
改善磁共振(MR)图像数据的分辨率对于计算机辅助诊断和大脑功能分析至关重要。更高的分辨率有助于捕获更详细的内容,但通常会导致较低的信噪比和更长的扫描时间。为此,MR Image超级分辨率已成为近期广泛利益的主题。现有作品建立了广泛的深层模型,该模型具有基于卷积神经网络(CNN)的常规体系结构。在这项工作中,为了进一步推进该研究领域,我们尽早努力建立一个基于变压器的MR图像超分辨率框架,并仔细设计了探索有价值的领域的先验知识。具体而言,我们考虑了包括高频结构的两倍领域先验和模式间环境,并建立了一种新颖的变压器体系结构,称为跨模式高频变压器(COHF-T),以将此类先验引入超分辨率(LR)MR图像的超级分辨。两个数据集的实验表明COHF-T可以实现新的最新性能。
translated by 谷歌翻译
由于成像装置的约束和操作时间的高成本,电脑断层扫描(CT)扫描通常以低帧内分辨率获取。改善切片内分辨率对人类专家和计算机辅助系统的疾病诊断有益。为此,本文建立了一种新型医疗切片合成,以增加切片分辨率。考虑到临床实践中始终缺乏地面真理中间医学切片,我们介绍了以自我监督的学习方式实现这项任务的增量跨视图相互蒸馏策略。具体而言,我们从三种不同的视图模型在这种情况下,从不同视图中学到的模型可以蒸馏有价值的知识来引导彼此的学习过程。我们可以重复此过程以使模型通过增加切片分辨率来综合中间切片数据。为了证明所提出的方法的有效性,我们对大型CT数据集进行了全面的实验。定量和定性比较结果表明,我们的方法通过清晰的边缘来占据最先进的算法。
translated by 谷歌翻译
随着阿里巴巴的业务在各种行业中扩大世界各地,对大数据云计算平台的服务质量和可靠性施加了更高的标准,这构成了阿里巴巴云的基础设施。然而,由于系统架构复杂,这些平台中的根本原因分析是非微不足道的。在本文中,我们提出了一个根本原因分析框架,称为Cloudrca,它利用包括关键绩效指标(KPI),日志以及拓扑的异构多源数据,并通过最先进的异常提取重要特征检测和日志分析技术。然后在知识通知的分层贝叶斯网络(KHBN)模型中使用工程化特征,以推断出高精度和效率的根本原因。消融研究和综合实验比较表明,与现有框架,Cloudrca 1相比,Cloudrca 1)始终如一地优于不同云系统的F1分数的现有方法; 2)由于KHBN的层次结构,可以处理新颖的根本原因; 3)相对于算法配置更强大地执行; 4)在数据和特征尺寸中更有利地缩放。实验还表明,可以采用跨平台转移学习机制来进一步提高10%以上的准确性。 Cloudrca已被整合到阿里巴巴云的诊断系统中,并在三个典型的云计算平台中使用,包括MaxCompute,实时计算和Hologres。它节省了站点可靠性工程师(SRES)在过去的十二个月内解决故障的时间超过20美元,并且显着提高了服务可靠性。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译