Knowledge graph embedding (KGE), which maps entities and relations in a knowledge graph into continuous vector spaces, has achieved great success in predicting missing links in knowledge graphs. However, knowledge graphs often contain incomplete triples that are difficult to inductively infer by KGEs. To address this challenge, we resort to analogical inference and propose a novel and general self-supervised framework AnKGE to enhance KGE models with analogical inference capability. We propose an analogical object retriever that retrieves appropriate analogical objects from entity-level, relation-level, and triple-level. And in AnKGE, we train an analogy function for each level of analogical inference with the original element embedding from a well-trained KGE model as input, which outputs the analogical object embedding. In order to combine inductive inference capability from the original KGE model and analogical inference capability enhanced by AnKGE, we interpolate the analogy score with the base model score and introduce the adaptive weights in the score function for prediction. Through extensive experiments on FB15k-237 and WN18RR datasets, we show that AnKGE achieves competitive results on link prediction task and well performs analogical inference.
translated by 谷歌翻译
As an important variant of entity alignment (EA), multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs (KGs) with multiple modalities like images. However, current MMEA algorithms all adopt KG-level modality fusion strategies but ignore modality differences among individual entities, hurting the robustness to potential noise involved in modalities (e.g., unidentifiable images and relations). In this paper we present MEAformer, a multi-modal entity alignment transformer approach for meta modality hybrid, to dynamically predict the mutual correlation coefficients among modalities for instance-level feature fusion. A modal-aware hard entity replay strategy is also proposed for addressing vague entity details. Extensive experimental results show that our model not only achieves SOTA performance on multiple training scenarios including supervised, unsupervised, iterative, and low resource, but also has limited parameters, optimistic speed, and good interpretability. Our code will be available soon.
translated by 谷歌翻译
基于多模式方面的情感分类(MABSC)是一项新兴的分类任务,旨在将给定目标的情感分类,例如具有不同模式的数据中提到的实体。在带有文本和图像的典型多模式数据中,以前的方法不能充分利用图像的细颗粒语义,尤其是与文本的语义结合在一起,并且不完全考虑对细粒图像之间的关系进行建模信息和目标,这导致图像的使用不足和不足以识别细粒度的方面和意见。为了应对这些局限性,我们提出了一个新的框架SEQCSG,包括一种构建顺序跨模式语义图和编码器模型的方法。具体而言,我们从原始图像,图像标题和场景图中提取细粒度的信息,并将它们视为跨模式语义图的元素以及文本的令牌。跨模式语义图表示为具有多模式可见矩阵的序列,指示元素之间的关系。为了有效地利用跨模式语义图,我们建议使用目标提示模板的编码器解码器方法。实验结果表明,我们的方法优于现有方法,并在两个标准数据集MABSC上实现了最新方法。进一步的分析证明了每个组件的有效性,我们的模型可以隐含地学习图像的目标和细粒度信息之间的相关性。
translated by 谷歌翻译
视觉问题回答(VQA)通常需要对视觉概念和语言语义的理解,这取决于外部知识。大多数现有方法利用了预训练的语言模型或/和非结构化文本,但是这些资源中的知识通常不完整且嘈杂。有些方法更喜欢使用经常具有强化结构知识的知识图(kgs),但是研究仍然相当初步。在本文中,我们提出了Lako,这是一种知识驱动的VQA方法,通过后期的文本注射。为了有效地纳入外部kg,我们将三元三元转移到文本中,并提出一种晚期注射机制。最后,我们将VQA作为文本生成任务,并具有有效的编码器范式。在使用OKVQA数据集的评估中,我们的方法可实现最新的结果。
translated by 谷歌翻译
零击学习(ZSL)旨在预测看不见的课程,其样本在培训期间从未出现过,经常利用其他语义信息(又称侧信息)来桥接培训(见过)课程和看不见的课程。用于零拍图像分类的最有效且最广泛使用的语义信息之一是属性,是类级视觉特征的注释。但是,由于细粒度的注释短缺,属性不平衡和同时出现,当前方法通常无法区分图像之间的那些微妙的视觉区别,从而限制了它们的性能。在本文中,我们提出了一种名为Duet的基于变压器的端到端ZSL方法,该方法通过自我监督的多模式学习范式从审前的语言模型(PLM)中整合了潜在的语义知识。具体而言,我们(1)开发了一个跨模式的语义接地网络,以研究模型从图像中解开语义属性的能力,(2)应用了属性级的对比度学习策略,以进一步增强模型对细粒视觉特征的歧视反对属性的共同出现和不平衡,(3)提出了一个多任务学习策略,用于考虑多模型目标。通过对三个标准ZSL基准测试和配备ZSL基准的知识图进行广泛的实验,我们发现二重奏通常可以实现最新的性能,其组件是有效的,并且其预测是可以解释的。
translated by 谷歌翻译
知识图(kg)及其本体论的变体已被广泛用于知识表示,并且已证明在增强零拍学习(ZSL)方面非常有效。但是,利用KGS的现有ZSL方法都忽略了KGS中代表的类间关系的内在复杂性。一个典型的功能是,一类通常与不同语义方面的其他类别有关。在本文中,我们专注于增强ZSL的本体,并建议学习以本体论属性为指导的解剖本体嵌入,以捕获和利用不同方面的更细粒度的类关系。我们还贡献了一个名为dozsl的新ZSL框架,该框架包含两个新的ZSL解决方案,分别基于生成模型和图形传播模型有效地利用了分解的本体学嵌入。已经对零摄像图分类(ZS-IMGC)和零射Hot KG完成(ZS-KGC)进行了五个基准测试进行了广泛的评估。 Dozsl通常比最先进的表现更好,并且通过消融研究和案例研究证实了其组成部分。我们的代码和数据集可在https://github.com/zjukg/dozsl上找到。
translated by 谷歌翻译
我们从一组未配对的清晰和朦胧的图像中提供了实用的基于学习的图像飞行网络。本文提供了一种新的观点,可以将图像除去作为两类分离的因子分离任务,即清晰图像重建的任务相关因素以及与雾霾相关的分布的任务含量。为了在深度特征空间中实现这两类因素的分离,将对比度学习引入了一个自行车框架中,以通过指导与潜在因素相关的生成的图像来学习分离的表示形式。通过这种表述,提出的对比度拆除的脱掩护方法(CDD-GAN)采用负面发电机与编码器网络合作以交替进行更新,以产生挑战性负面对手的队列。然后,这些负面的对手是端到端训练的,以及骨干代表网络,以通过最大化对抗性对比损失来增强歧视性信息并促进因素分离性能。在培训期间,我们进一步表明,硬性负面例子可以抑制任务 - 无关紧要的因素和未配对的清晰景象可以增强与任务相关的因素,以便更好地促进雾霾去除并帮助图像恢复。对合成和现实世界数据集的广泛实验表明,我们的方法对现有的未配对飞行基线的表现良好。
translated by 谷歌翻译
With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.
translated by 谷歌翻译
One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.
translated by 谷歌翻译
The ability to create realistic, animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting in the color estimation, thus they are limited in re-rendering the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods.
translated by 谷歌翻译