知识图完成最近已广泛研究,以通过主要建模图结构特征来完成三元组中的缺失元素,但对图形结构的稀疏性敏感。期望解决这一挑战的相关文本,例如实体名称和描述,充当知识图(kgs)的另一种表达形式(kgs)。已经提出了几种使用两个编码器的结构和文本消息的方法,但由于未能平衡它们之间的权重有限。并在推理期间保留结构和文本编码器,也遭受了沉重的参数。通过知识蒸馏的激励,我们将知识视为从输入到输出概率的映射,并在稀疏的kgs上提出了一个插件框架VEM2L,以将从文本和结构消息提取到统一的知识中融合知识。具体而言,我们将模型获取的知识分配为两个不重叠的部分:一个部分与训练三元组合的合适能力有关,可以通过激励两个编码者互相学习训练集来融合。另一个反映了未观察到的查询的概括能力。相应地,我们提出了一种新的融合策略,该策略由变量EM算法证明,以融合模型的概括能力,在此期间,我们还应用图形致密操作以进一步缓解稀疏的图形问题。通过结合这两种融合方法,我们最终提出了VEM2L框架。详细的理论证据以及定量和定性实验都证明了我们提出的框架的有效性和效率。
translated by 谷歌翻译
我们提出了一种跨模型关注蒸馏框架,用于培训双编码器模型,用于了解视觉语言理解任务,例如视觉推理和视觉问题应答。双编码器模型的推理速度比Fusion-encoder模型更快,并在推理期间启用图像和文本的预算。然而,双编码器模型中使用的浅交互模块不足以处理复杂的视觉语言理解任务。为了学习图像和文本的深度互动,我们引入了跨模型注意蒸馏,它使用融合编码器模型的图像到文本和文本到图像注意力分布来指导我们的双编码器的培训模型。此外,我们表明,适用于预训练和微调阶段的跨模型注意蒸馏实现了进一步的改进。实验结果表明,蒸馏的双编码器模型可实现视觉推理,视觉征求和视觉问题的竞争性能,同时享受比Fusion-Conoder模型更快的推理速度。我们的代码和型号将在https://github.com/kugwzk/distilled -dualiCoder上公开提供。
translated by 谷歌翻译
现有以查询为中心的摘要数据集的大小有限,使培训数据驱动的摘要模型提出了挑战。同时,以查询为重点的摘要语料库的手动构造昂贵且耗时。在本文中,我们使用Wikipedia自动收集超过280,000个示例的大型以查询为中心的摘要数据集(名为Wikiref),这可以用作数据增强的手段。我们还开发了一个基于BERT的以查询为重点的摘要模型(Q-bert),以从文档中提取句子作为摘要。为了更好地调整包含数百万个参数的巨大模型,我们仅识别和微调一个稀疏的子网络,这对应于整个模型参数的一小部分。三个DUC基准测试的实验结果表明,在Wikiref中预先培训的模型已经达到了合理的性能。在对特定基准数据集进行了微调后,具有数据增强的模型优于强大比较系统。此外,我们提出的Q-Bert模型和子网微调都进一步改善了模型性能。该数据集可在https://aka.ms/wikiref上公开获取。
translated by 谷歌翻译
Early-exiting dynamic neural networks (EDNN), as one type of dynamic neural networks, has been widely studied recently. A typical EDNN has multiple prediction heads at different layers of the network backbone. During inference, the model will exit at either the last prediction head or an intermediate prediction head where the prediction confidence is higher than a predefined threshold. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. This brings a train-test mismatch problem that all the prediction heads are optimized on all types of data in training phase while the deeper heads will only see difficult inputs in testing phase. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. To mitigate this problem, we formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively. We name our method BoostNet. Our experiments show it achieves the state-of-the-art performance on CIFAR100 and ImageNet datasets in both anytime and budgeted-batch prediction modes. Our code is released at https://github.com/SHI-Labs/Boosted-Dynamic-Networks.
translated by 谷歌翻译
Recently, Neural architecture search has achieved great success on classification tasks for mobile devices. The backbone network for object detection is usually obtained on the image classification task. However, the architecture which is searched through the classification task is sub-optimal because of the gap between the task of image and object detection. As while work focuses on backbone network architecture search for mobile device object detection is limited, mainly because the backbone always requires expensive ImageNet pre-training. Accordingly, it is necessary to study the approach of network architecture search for mobile device object detection without expensive pre-training. In this work, we propose a mobile object detection backbone network architecture search algorithm which is a kind of evolutionary optimized method based on non-dominated sorting for NAS scenarios. It can quickly search to obtain the backbone network architecture within certain constraints. It better solves the problem of suboptimal linear combination accuracy and computational cost. The proposed approach can search the backbone networks with different depths, widths, or expansion sizes via a technique of weight mapping, making it possible to use NAS for mobile devices detection tasks a lot more efficiently. In our experiments, we verify the effectiveness of the proposed approach on YoloX-Lite, a lightweight version of the target detection framework. Under similar computational complexity, the accuracy of the backbone network architecture we search for is 2.0% mAP higher than MobileDet. Our improved backbone network can reduce the computational effort while improving the accuracy of the object detection network. To prove its effectiveness, a series of ablation studies have been carried out and the working mechanism has been analyzed in detail.
translated by 谷歌翻译
跳过连接是编码器网络中的基本单元,能够改善神经网络的特征宣传。但是,大多数带有跳过连接的方法仅连接了编码器和解码器中相同分辨率的连接功能,这忽略了编码器中的信息损失,而图层的进度更深。为了利用编码器较浅层中特征的信息损失,我们提出了一个完整的跳过连接网络(FSCN),以实现单眼深度估计任务。此外,要更接近跳过连接中的功能,我们提出了一个自适应串联模块(ACM)。此外,我们对FSCN和FSCN的室内和室内数据集(即Kitti Dataste和NYU DEPTH DATASET)进行了广泛的实验。
translated by 谷歌翻译
辍学通常用于量化预测不确定性,即给定输入示例上模型预测的变化。但是,在实践中使用辍学可能是昂贵的,因为它需要多次运行辍学推理。在本文中,我们研究了如何以资源有效的方式估计辍学的预测不确定性。我们证明,我们可以使用神经元激活强度来估计不同辍学设置下的辍学预测不确定性,并使用三个大型数据集(Movielens,Criteo和Emnist)进行多种任务。我们的方法提供了一种推理方法,将辍学预测不确定性视为廉价辅助任务。我们还证明,使用来自神经网络层的一个子集的激活特征足以达到不确定性估计性能几乎可以与使用所有层的激活特征相当,从而进一步降低了不确定性估计的资源。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译