在本文中,我们介绍了2022年多模式情感分析挑战(MUSE)的解决方案,其中包括Muse-Humor,Muse-Rection和Muse Surns Sub-Challenges。 2022年穆斯穆斯(Muse 2022)着重于幽默检测,情绪反应和多模式的情感压力,利用不同的方式和数据集。在我们的工作中,提取了不同种类的多模式特征,包括声学,视觉,文本和生物学特征。这些功能由Temma和Gru融合到自发机制框架中。在本文中,1)提取了一些新的音频功能,面部表达功能和段落级文本嵌入以进行准确的改进。 2)我们通过挖掘和融合多模式特征来显着提高多模式情感预测的准确性和可靠性。 3)在模型培训中应用有效的数据增强策略,以减轻样本不平衡问题并防止模型形成学习有偏见的主题字符。对于博物馆的子挑战,我们的模型获得了0.8932的AUC分数。对于Muse Rection子挑战,我们在测试集上的Pearson相关系数为0.3879,它的表现优于所有其他参与者。对于Muse Surst Sub-Challenge,我们的方法在测试数据集上的唤醒和价值都优于基线,达到了0.5151的最终综合结果。
translated by 谷歌翻译
基于变压器的视觉对象跟踪已广泛使用。但是,变压器结构缺乏足够的电感偏差。此外,仅专注于编码全局功能会损害建模本地细节,这限制了航空机器人中跟踪的能力。具体而言,通过局部模型为全球搜索机制,提出的跟踪器将全局编码器替换为新型的局部识别编码器。在使用的编码器中,仔细设计了局部识别的关注和局部元素校正网络,以减少全局冗余信息干扰和增加局部归纳偏见。同时,后者可以通过详细信息网络准确地在空中视图下对本地对象详细信息进行建模。所提出的方法在几种权威的空中基准中实现了竞争精度和鲁棒性,总共有316个序列。拟议的跟踪器的实用性和效率已通过现实世界测试得到了验证。
translated by 谷歌翻译
通过仅使用训练有素的分类器,模型内(MI)攻击可以恢复用于训练分类器的数据,从而导致培训数据的隐私泄漏。为了防止MI攻击,先前的工作利用单方面依赖优化策略,即,在培训分类器期间,最大程度地减少了输入(即功能)和输出(即标签)之间的依赖关系。但是,这样的最小化过程与最小化监督损失相冲突,该损失旨在最大程度地提高输入和输出之间的依赖关系,从而在模型鲁棒性针对MI攻击和模型实用程序上对分类任务进行明确的权衡。在本文中,我们旨在最大程度地减少潜在表示和输入之间的依赖性,同时最大化潜在表示和输出之间的依赖关系,称为双边依赖性优化(BIDO)策略。特别是,除了对深神经网络的常用损失(例如,跨渗透性)外,我们还将依赖性约束用作普遍适用的正常化程序,可以根据不同的任务将其实例化使用适当的依赖标准。为了验证我们策略的功效,我们通过使用两种不同的依赖性度量提出了两种BIDO的实施:具有约束协方差的Bido(Bido-Coco)(Bido-Coco)和Bido具有Hilbert-Schmidt独立标准(Bido-HSIC)。实验表明,比多(Bido防御MI攻击的道路。
translated by 谷歌翻译
介绍了一种名为VMagent的新型模拟器,以帮助RL研究人员更好地探索新方法,特别是对于虚拟机调度。VMagent由实用虚拟机(VM)调度任务的启发,并提供了一个有效的仿真平台,可以反映云计算的实际情况。从实际云计算结束了三种情况(衰落,恢复和扩展),对应于许多强化学习挑战(高维度和行动空间,高于寿命和终身需求)。VMagent为RL研究人员提供了灵活的配置,以设计考虑不同的问题特征的定制调度环境。从VM调度角度来看,VMagent还有助于探索更好的基于学习的调度解决方案。
translated by 谷歌翻译
过去几年的技术创新的巨大浪潮,标志着AI技术的进展,是深刻的重塑行业和社会。然而,在路上,一个关键的挑战等待着我们,即我们满足快速增长的情景的能力的能力受到收购培训数据的成本的严重限制。由于主流学习范式的局限性,这一困难的局面是基于主流学习范式的局限性:我们需要根据大量注释的数据以及通常从头来训练每个新场景的新模型。在解决这一基本问题时,我们超越并开发一个名为实习生的新学习范式。通过在多个阶段的来自多个来源的监控信号学习,培训的模型将产生强大的相互性。我们在26个众所周知的数据集中评估我们的模型,该数据集涵盖计算机视觉中的四类任务。在大多数情况下,我们的模型仅适用于目标域中的培训数据的10%,始终以完整的数据培训的对应物,通常由显着的边距。这是一个重要前景的重要一步,其中具有一般视觉能力的这种模型可以大大降低对数据的依赖,从而加速通过AI技术的采用。此外,围绕我们的新范式旋转,我们还介绍了一个新的数据系统,新的架构和新的基准,以及一起形成一般愿景生态系统,以开放和包容性的方式支持其未来的发展。
translated by 谷歌翻译
人类通过不同的渠道表达感受或情绪。以语言为例,它在不同的视觉声学上下文下需要不同的情绪。为了精确了解人类意图,并减少歧义和讽刺引起的误解,我们应该考虑多式联路信号,包括文本,视觉和声学信号。至关重要的挑战是融合不同的特征模式以进行情绪分析。为了有效地融合不同的方式携带的信息,更好地预测情绪,我们设计了一种基于新的多主题的融合网络,这是由任何两个对方式之间的相互作用不同的观察来启发,它们是不同的,并且它们不同样有助于最终的情绪预测。通过分配具有合理关注和利用残余结构的声学 - 视觉,声学 - 文本和视觉文本特征,我们参加了重要的特征。我们对四个公共多模式数据集进行了广泛的实验,包括中文和三种英文中的一个。结果表明,我们的方法优于现有的方法,并可以解释双模相互作用在多种模式中的贡献。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the bio-medical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect the annotation entity's interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of Peak Ground Truth (PGT) is introduced. PGT marks the point beyond which an increase in similarity with the reference annotation stops translating to better Real World Model Performance (RWMP). Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, three categories of PGT-aware strategies to evaluate and improve model performance are reviewed.
translated by 谷歌翻译