Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Fine-grained capturing of 3D HOI boosts human activity understanding and facilitates downstream visual tasks, including action recognition, holistic scene reconstruction, and human motion synthesis. Despite its significance, existing works mostly assume that humans interact with rigid objects using only a few body parts, limiting their scope. In this paper, we address the challenging problem of f-AHOI, wherein the whole human bodies interact with articulated objects, whose parts are connected by movable joints. We present CHAIRS, a large-scale motion-captured f-AHOI dataset, consisting of 16.2 hours of versatile interactions between 46 participants and 81 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions. We show the value of CHAIRS with object pose estimation. By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation to tackle the estimation of articulated object poses and shapes during whole-body interactions. Given an image and an estimated human pose, our model first reconstructs the pose and shape of the object, then optimizes the reconstruction according to a learned interaction prior. Under both evaluation settings (e.g., with or without the knowledge of objects' geometries/structures), our model significantly outperforms baselines. We hope CHAIRS will promote the community towards finer-grained interaction understanding. We will make the data/code publicly available.
translated by 谷歌翻译
Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evidence showing how sociodemographic features such as sex and race can impact the model performance across subgroups. With this work, our intent is to initiate a discussion about the intersectionality effects of old age, race and socioeconomic factors on the early-stage detection and prevention of delirium using ML.
translated by 谷歌翻译
Given sufficient training data on the source domain, cross-domain few-shot learning (CD-FSL) aims at recognizing new classes with a small number of labeled examples on the target domain. The key to addressing CD-FSL is to narrow the domain gap and transferring knowledge of a network trained on the source domain to the target domain. To help knowledge transfer, this paper introduces an intermediate domain generated by mixing images in the source and the target domain. Specifically, to generate the optimal intermediate domain for different target data, we propose a novel target guided dynamic mixup (TGDM) framework that leverages the target data to guide the generation of mixed images via dynamic mixup. The proposed TGDM framework contains a Mixup-3T network for learning classifiers and a dynamic ratio generation network (DRGN) for learning the optimal mix ratio. To better transfer the knowledge, the proposed Mixup-3T network contains three branches with shared parameters for classifying classes in the source domain, target domain, and intermediate domain. To generate the optimal intermediate domain, the DRGN learns to generate an optimal mix ratio according to the performance on auxiliary target data. Then, the whole TGDM framework is trained via bi-level meta-learning so that TGDM can rectify itself to achieve optimal performance on target data. Extensive experimental results on several benchmark datasets verify the effectiveness of our method.
translated by 谷歌翻译
协同的药物组合为增强治疗功效和减少不良反应提供了巨大的潜力。然而,由于未知的因果疾病信号通路,有效和协同的药物组合预测仍然是一个悬而未决的问题。尽管已经提出了各种深度学习(AI)模型来定量预测药物组合的协同作用。现有深度学习方法的主要局限性是它们本质上是不可解释的,这使得AI模型的结论是对人类专家的非透明度的结论,因此限制了模型结论的鲁棒性和这些模型在现实世界中的实施能力人类医疗保健。在本文中,我们开发了一个可解释的图神经网络(GNN),该神经网络(GNN)揭示了通过挖掘非常重要的亚分子网络来揭示协同(MOS)的基本基本治疗靶标和机制。可解释的GNN预测模型的关键点是一个新颖的图池层,基于自我注意的节点和边缘池(此后为SANEPOOL),可以根据节点特征和图表计算节点和边缘的注意力评分(重要性)拓扑。因此,提出的GNN模型提供了一种系统的方法来预测和解释基于检测到的关键亚分子网络的药物组合协同作用。我们评估了来自NCI Almanac药物组合筛查数据的46个核心癌症信号通路和药物组合的基因制造的分子网络。实验结果表明,1)Sanepool可以在其他流行的图神经网络中实现当前的最新性能; 2)由SANEPOOOL检测到的亚分子网络是可自我解释的,并且可以鉴定协同的药物组合。
translated by 谷歌翻译
作为重要的数据挖掘技术,高公用事业项目集挖掘(HUIM)用于找出有趣但隐藏的信息(例如,利润和风险)。 HUIM已广泛应用于许多应用程序方案,例如市场分析,医疗检测和网络点击流分析。但是,大多数以前的HUIM方法通常忽略项目集中项目之间的关系。因此,在Huim中发现了许多无关的组合(例如,\ {Gold,Apple \}和\ {笔记本,书籍\})。为了解决这一限制,已经提出了许多算法来开采相关的高公用事业项目集(Cohuis)。在本文中,我们提出了一种新型算法,称为Itemset实用性最大化,相关度量(COIUM),该算法既考虑较强的相关性,又考虑了项目的有利可图。此外,新型算法采用数据库投影机制来降低数据库扫描的成本。此外,利用了两种上限和四种修剪策略来有效修剪搜索空间。并使用一个名为“实用程序”的简洁阵列结构来计算和存储在线性时间和空间中所采用的上限。最后,对密集和稀疏数据集的广泛实验结果表明,在运行时和内存消耗方面,COIUM显着优于最新算法。
translated by 谷歌翻译
对应匹配是计算机视觉和机器人技术应用中的一个基本问题。最近使用神经网络解决对应匹配问题最近正在上升。旋转等级和比例等级性在对应匹配应用中都至关重要。经典的对应匹配方法旨在承受缩放和旋转转换。但是,使用卷积神经网络(CNN)提取的功能仅在一定程度上是翻译等值的。最近,研究人员一直在努力改善基于群体理论的CNN的旋转均衡性。 SIM(2)是2D平面中的相似性转换组。本文介绍了专门用于评估SIM(2) - 等级对应算法的专门数据集。我们比较了16个最先进(SOTA)对应匹配方法的性能。实验结果表明,在各种SIM(2)转换条件下,组模棱两可算法对于对应匹配的重要性。由于基于CNN的对应匹配方法达到的子像素精度不令人满意,因此该特定领域需要在未来的工作中获得更多关注。我们的数据集可公开可用:mias.group/sim2e。
translated by 谷歌翻译
视觉(RE)本地化解决了估计已知场景中捕获的查询图像的6-DOF(自由度)摄像头的问题,该镜头是许多计算机视觉和机器人应用程序的关键构建块。基于结构的本地化的最新进展通过记住从图像像素到场景坐标的映射与神经网络的映射来构建相机姿势优化的2D-3D对应关系。但是,这种记忆需要在每个场景中训练大量的图像,这是沉重效率降低的。相反,通常很少的图像足以覆盖场景的主要区域,以便人类操作员执行视觉定位。在本文中,我们提出了一种场景区域分类方法,以实现几乎没有拍摄图像的快速有效的场景记忆。我们的见解是利用a)预测的特征提取器,b)场景区域分类器和c)元学习策略,以加速培训,同时缓解过度拟合。我们在室内和室外基准上评估了我们的方法。该实验验证了我们方法在几次设置中的有效性,并且训练时间大大减少到只有几分钟。代码可用:\ url {https://github.com/siyandong/src}
translated by 谷歌翻译