由于互动交通参与者的随机性质和道路结构的复杂性,城市自动驾驶的决策是具有挑战性的。尽管基于强化的学习(RL)决策计划有望处理城市驾驶方案,但它的样本效率低和适应性差。在本文中,我们提出了Scene-Rep Transformer,以通过更好的场景表示编码和顺序预测潜在蒸馏来提高RL决策能力。具体而言,构建了多阶段变压器(MST)编码器,不仅对自我车辆及其邻居之间的相互作用意识进行建模,而且对代理商及其候选路线之间的意图意识。具有自我监督学习目标的连续潜伏变压器(SLT)用于将未来的预测信息提炼成潜在的场景表示,以减少勘探空间并加快训练的速度。基于软演员批评的最终决策模块(SAC)将来自场景rep变压器的精制潜在场景表示输入,并输出驾驶动作。该框架在五个挑战性的模拟城市场景中得到了验证,其性能通过成功率,安全性和效率方面的数据效率和性能的大幅度提高来定量表现出来。定性结果表明,我们的框架能够提取邻居代理人的意图,以帮助做出决策并提供更多多元化的驾驶行为。
translated by 谷歌翻译
Proper functioning of connected and automated vehicles (CAVs) is crucial for the safety and efficiency of future intelligent transport systems. Meanwhile, transitioning to fully autonomous driving requires a long period of mixed autonomy traffic, including both CAVs and human-driven vehicles. Thus, collaboration decision-making for CAVs is essential to generate appropriate driving behaviors to enhance the safety and efficiency of mixed autonomy traffic. In recent years, deep reinforcement learning (DRL) has been widely used in solving decision-making problems. However, the existing DRL-based methods have been mainly focused on solving the decision-making of a single CAV. Using the existing DRL-based methods in mixed autonomy traffic cannot accurately represent the mutual effects of vehicles and model dynamic traffic environments. To address these shortcomings, this article proposes a graph reinforcement learning (GRL) approach for multi-agent decision-making of CAVs in mixed autonomy traffic. First, a generic and modular GRL framework is designed. Then, a systematic review of DRL and GRL methods is presented, focusing on the problems addressed in recent research. Moreover, a comparative study on different GRL methods is further proposed based on the designed framework to verify the effectiveness of GRL methods. Results show that the GRL methods can well optimize the performance of multi-agent decision-making for CAVs in mixed autonomy traffic compared to the DRL methods. Finally, challenges and future research directions are summarized. This study can provide a valuable research reference for solving the multi-agent decision-making problems of CAVs in mixed autonomy traffic and can promote the implementation of GRL-based methods into intelligent transportation systems. The source code of our work can be found at https://github.com/Jacklinkk/Graph_CAVs.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
在多机构动态交通情况下的自主驾驶具有挑战性:道路使用者的行为不确定,很难明确建模,并且自我车辆应与他们应用复杂的谈判技巧,例如屈服,合并和交付,以实现,以实现在各种环境中都有安全有效的驾驶。在这些复杂的动态场景中,传统的计划方法主要基于规则,并且通常会导致反应性甚至过于保守的行为。因此,他们需要乏味的人类努力来维持可行性。最近,基于深度学习的方法显示出令人鼓舞的结果,具有更好的概括能力,但手工工程的工作较少。但是,它们要么是通过有监督的模仿学习(IL)来实施的,该学习遭受了数据集偏见和分配不匹配问题,要么接受了深入强化学习(DRL)的培训,但专注于一种特定的交通情况。在这项工作中,我们建议DQ-GAT实现可扩展和主动的自主驾驶,在这些驾驶中,基于图形注意力的网络用于隐式建模相互作用,并采用了深层Q学习来以无聊的方式训练网络端到端的网络。 。在高保真驾驶模拟器中进行的广泛实验表明,我们的方法比以前的基于学习的方法和传统的基于规则的方法获得了更高的成功率,并且在可见和看不见的情况下都可以更好地摆脱安全性和效率。此外,轨迹数据集的定性结果表明,我们所学的政策可以通过实时速度转移到现实世界中。演示视频可在https://caipeide.github.io/dq-gat/上找到。
translated by 谷歌翻译
End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.
translated by 谷歌翻译
连续空间中有效有效的探索是将加固学习(RL)应用于自主驾驶的核心问题。从专家演示或为特定任务设计的技能可以使探索受益,但是它们通常是昂贵的,不平衡/次优的,或者未能转移到各种任务中。但是,人类驾驶员可以通过在整个技能空间中进行高效和结构性探索而不是具有特定于任务的技能的有限空间来适应各种驾驶任务。受上述事实的启发,我们提出了一种RL算法,以探索所有可行的运动技能,而不是一组有限的特定于任务和以对象为中心的技能。没有演示,我们的方法仍然可以在各种任务中表现出色。首先,我们以纯粹的运动角度构建了一个任务不合时宜的和以自我为中心的(TAEC)运动技能库,该运动技能库是足够多样化的,可以在不同的复杂任务中重复使用。然后,将运动技能编码为低维的潜在技能空间,其中RL可以有效地进行探索。在各种具有挑战性的驾驶场景中的验证表明,我们提出的方法TAEC-RL在学习效率和任务绩效方面的表现显着优于其同行。
translated by 谷歌翻译
Reinforcement learning (RL) requires skillful definition and remarkable computational efforts to solve optimization and control problems, which could impair its prospect. Introducing human guidance into reinforcement learning is a promising way to improve learning performance. In this paper, a comprehensive human guidance-based reinforcement learning framework is established. A novel prioritized experience replay mechanism that adapts to human guidance in the reinforcement learning process is proposed to boost the efficiency and performance of the reinforcement learning algorithm. To relieve the heavy workload on human participants, a behavior model is established based on an incremental online learning method to mimic human actions. We design two challenging autonomous driving tasks for evaluating the proposed algorithm. Experiments are conducted to access the training and testing performance and learning mechanism of the proposed algorithm. Comparative results against the state-of-the-art methods suggest the advantages of our algorithm in terms of learning efficiency, performance, and robustness.
translated by 谷歌翻译
自动驾驶在过去二十年中吸引了重要的研究兴趣,因为它提供了许多潜在的好处,包括释放驾驶和减轻交通拥堵的司机等。尽管进展有前途,但车道变化仍然是自治车辆(AV)的巨大挑战,特别是在混合和动态的交通方案中。最近,强化学习(RL)是一种强大的数据驱动控制方法,已被广泛探索了在令人鼓舞的效果中的通道中的车道改变决策。然而,这些研究的大多数研究专注于单车展,并且在多个AVS与人类驱动车辆(HDV)共存的情况下,道路变化已经受到稀缺的关注。在本文中,我们在混合交通公路环境中制定了多个AVS的车道改变决策,作为多功能增强学习(Marl)问题,其中每个AV基于相邻AV的动作使车道变化的决定和HDV。具体地,使用新颖的本地奖励设计和参数共享方案开发了一种多代理优势演员批评网络(MA2C)。特别是,提出了一种多目标奖励功能来纳入燃油效率,驾驶舒适度和自主驾驶的安全性。综合实验结果,在三种不同的交通密度和各级人类司机侵略性下进行,表明我们所提出的Marl框架在效率,安全和驾驶员舒适方面始终如一地优于几个最先进的基准。
translated by 谷歌翻译
相应地预测周围交通参与者的未来状态,并计划安全,平稳且符合社会的轨迹对于自动驾驶汽车至关重要。当前的自主驾驶系统有两个主要问题:预测模块通常与计划模块解耦,并且计划的成本功能很难指定和调整。为了解决这些问题,我们提出了一个端到端的可区分框架,该框架集成了预测和计划模块,并能够从数据中学习成本函数。具体而言,我们采用可区分的非线性优化器作为运动计划者,该运动计划将神经网络给出的周围剂的预测轨迹作为输入,并优化了自动驾驶汽车的轨迹,从而使框架中的所有操作都可以在框架中具有可观的成本,包括成本功能权重。提出的框架经过大规模的现实驾驶数据集进行了训练,以模仿整个驾驶场景中的人类驾驶轨迹,并在开环和闭环界面中进行了验证。开环测试结果表明,所提出的方法的表现优于各种指标的基线方法,并提供以计划为中心的预测结果,从而使计划模块能够输出接近人类的轨迹。在闭环测试中,提出的方法表明能够处理复杂的城市驾驶场景和鲁棒性,以抵抗模仿学习方法所遭受的分配转移。重要的是,我们发现计划和预测模块的联合培训比在开环和闭环测试中使用单独的训练有素的预测模块进行计划要比计划更好。此外,消融研究表明,框架中的可学习组件对于确保计划稳定性和性能至关重要。
translated by 谷歌翻译
在动态,多助手和复杂的城市环境中驾驶是一个需要复杂的决策政策的艰巨任务。这种策略的学习需要可以编码整个环境的状态表示。作为图像编码车辆环境的中级表示已成为一种受欢迎的选择。仍然,它们是非常高的,限制了他们在诸如加固学习等数据饥饿的方法的使用。在本文中,我们建议通过利用相关语义因素的知识来学习环境的低维度和丰富的潜在表示。为此,我们训练编码器解码器深神经网络,以预测多种应用相关因素,例如其他代理和自助车的轨迹。此外,我们提出了一种基于其他车辆的未来轨迹的危险信号和计划的路由,这些路线与学习的潜在表示作为输入到下游策略的输入。我们演示了使用多头编码器解码器神经网络导致比标准单头模型更具信息的表示。特别是,所提出的代表学习和危险信号有助于加强学习以更快地学习,而性能提高,数据比基线方法更快。
translated by 谷歌翻译
Making safe and human-like decisions is an essential capability of autonomous driving systems and learning-based behavior planning is a promising pathway toward this objective. Distinguished from existing learning-based methods that directly output decisions, this work introduces a predictive behavior planning framework that learns to predict and evaluate from human driving data. Concretely, a behavior generation module first produces a diverse set of candidate behaviors in the form of trajectory proposals. Then the proposed conditional motion prediction network is employed to forecast other agents' future trajectories conditioned on each trajectory proposal. Given the candidate plans and associated prediction results, we learn a scoring module to evaluate the plans using maximum entropy inverse reinforcement learning (IRL). We conduct comprehensive experiments to validate the proposed framework on a large-scale real-world urban driving dataset. The results reveal that the conditional prediction model is able to forecast multiple possible future trajectories given a candidate behavior and the prediction results are reactive to different plans. Moreover, the IRL-based scoring module can properly evaluate the trajectory proposals and select close-to-human ones. The proposed framework outperforms other baseline methods in terms of similarity to human driving trajectories. Moreover, we find that the conditional prediction model can improve both prediction and planning performance compared to the non-conditional model, and learning the scoring module is critical to correctly evaluating the candidate plans to align with human drivers.
translated by 谷歌翻译
应用强化学习来自动驾驶需要某些挑战,这主要是由于大规模的交通流动,这种挑战是动态变化的。为了应对此类挑战,有必要快速确定对周围车辆不断变化的意图的响应策略。因此,我们提出了一种新的政策优化方法,用于使用基于图的互动感知约束来安全驾驶。在此框架中,运动预测和控制模块是同时训练的,同时共享包含社会环境的潜在表示。此外,为了反映社交互动,我们以图形形式表达了代理的运动并过滤特征。这有助于保留相邻节点的时空位置。此外,我们创建反馈循环以有效地组合这两个模块。结果,这种方法鼓励博学的控制器免受动态风险的侵害,并在各种情况下使运动预测强大。在实验中,我们与城市驾驶模拟器Carla建立了一个包括各种情况的导航场景。该实验表明,与基线相比,导航策略和运动预测的两侧的最新性能。
translated by 谷歌翻译
行人在场的运动控制算法对于开发安全可靠的自动驾驶汽车(AV)至关重要。传统运动控制算法依赖于手动设计的决策政策,这些政策忽略了AV和行人之间的相互作用。另一方面,深度强化学习的最新进展允许在没有手动设计的情况下自动学习政策。为了解决行人在场的决策问题,作者介绍了一个基于社会价值取向和深入强化学习(DRL)的框架,该框架能够以不同的驾驶方式生成决策政策。该政策是在模拟环境中使用最先进的DRL算法培训的。还引入了适合DRL训练的新型计算效率的行人模型。我们执行实验以验证我们的框架,并对使用两种不同的无模型深钢筋学习算法获得的策略进行了比较分析。模拟结果表明,开发的模型如何表现出自然的驾驶行为,例如短暂的驾驶行为,以促进行人的穿越。
translated by 谷歌翻译
对于自动驾驶汽车而言,遍历交叉点是一个具有挑战性的问题,尤其是当交叉路口没有交通控制时。最近,由于其成功处理自动驾驶任务,深厚的强化学习受到了广泛的关注。在这项工作中,我们解决了使用新颖的课程进行深入增强学习的问题的问题。拟议的课程导致:1)与未经课程训练的代理人相比,增强剂学习代理的更快的训练过程和2)表现更好。我们的主要贡献是两个方面:1)提供一个独特的课程,用于训练深入的强化学习者,2)显示了所提出的课程在未信号的交叉遍历任务中的应用。该框架期望自动驾驶汽车的感知系统对周围环境进行了处理。我们在Comonroad运动计划模拟器中测试我们的TTTERTIONS和四向交集的方法。
translated by 谷歌翻译
尽管等级加固学习的进步,但其在高速公路上自动驾驶中的路径规划的应用是具有挑战性的。一个原因是传统的等级加强学习方法由于其危险而无法自动驾驶,因此代理必须移动避免多个障碍物,例如高度不可预测的其他代理,因此安全区域较小,散射,随着时间的推移而变化。为了克服这一挑战,我们提出了一种用于国家空间和政策空间的空间分层加强学习方法。高级策略不仅选择行为子策略,而且选择在国家空间中和政策空间中的概要中致力于思维的区域。随后,低级政策阐述了代理在由高级命令选择的区域的轮廓内的短期目标位置。我们的方法中建议的网络结构和优化与单级方法一样简洁。各种形状的道路环境的实验表明,我们的方法发现了早期发作的几乎最佳的政策,优于基线等级加强学习方法,特别是在狭窄和复杂的道路上。在道路上产生的轨迹类似于人类策略对行为规划水平的策略。
translated by 谷歌翻译
培训可以在各种城市和公路情景中自主推动的智能代理在过去几十年中是机器人学会的热门话题。然而,在道路拓扑和邻近车辆定位方面的驾驶环境的多样性使得这个问题非常具有挑战性。不言而喻,虽然自动驾驶的场景特定的驾驶政策是有前途的,并且可以提高运输安全性和效率,但它们显然不是一个通用的可扩展解决方案。相反,我们寻求决策计划和驾驶策略,可以概括为新颖和看不见的环境。在这项工作中,我们利用了人类司机学习其周围环境的抽象表达的关键思想,这在各种驾驶场景和环境中相当类似。通过这些陈述,人类司机能够快速适应新颖的环境和在看不见的条件下驱动。正式地,通过强制信息瓶颈,我们提取一个潜在的表示,最小化\ extentit {距离} - 我们介绍的量化,以便在驱动场景之间介绍不同驾驶配置之间的相似性。然后采用这种潜在的空间作为Q学习模块的输入,以学习更广泛的驾驶策略。我们的实验表明,使用这种潜在的表示可以将崩溃的数量减少到大约一半。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
安全驾驶需要人类和智能代理的多种功能,例如无法看到环境的普遍性,对周围交通的安全意识以及复杂的多代理设置中的决策。尽管强化学习取得了巨大的成功(RL),但由于缺乏集成的环境,大多数RL研究工作分别研究了每个能力。在这项工作中,我们开发了一个名为MetAdrive的新驾驶模拟平台,以支持对机器自治的可概括增强学习算法的研究。 Metadrive具有高度的组成性,可以从程序生成和实际数据导入的实际数据中产生无限数量的不同驾驶场景。基于Metadrive,我们在单一代理和多代理设置中构建了各种RL任务和基线,包括在看不见的场景,安全探索和学习多机构流量的情况下进行基准标记。对程序生成的场景和现实世界情景进行的概括实验表明,增加训练集的多样性和大小会导致RL代理的推广性提高。我们进一步评估了元数据环境中各种安全的增强学习和多代理增强学习算法,并提供基准。源代码,文档和演示视频可在\ url {https://metadriverse.github.io/metadrive}上获得。
translated by 谷歌翻译
Traditional planning and control methods could fail to find a feasible trajectory for an autonomous vehicle to execute amongst dense traffic on roads. This is because the obstacle-free volume in spacetime is very small in these scenarios for the vehicle to drive through. However, that does not mean the task is infeasible since human drivers are known to be able to drive amongst dense traffic by leveraging the cooperativeness of other drivers to open a gap. The traditional methods fail to take into account the fact that the actions taken by an agent affect the behaviour of other vehicles on the road. In this work, we rely on the ability of deep reinforcement learning to implicitly model such interactions and learn a continuous control policy over the action space of an autonomous vehicle. The application we consider requires our agent to negotiate and open a gap in the road in order to successfully merge or change lanes. Our policy learns to repeatedly probe into the target road lane while trying to find a safe spot to move in to. We compare against two model-predictive control-based algorithms and show that our policy outperforms them in simulation.
translated by 谷歌翻译
我们解决了由具有不同驱动程序行为的道路代理人填充的密集模拟交通环境中的自我车辆导航问题。由于其异构行为引起的代理人的不可预测性,这种环境中的导航是挑战。我们提出了一种新的仿真技术,包括丰富现有的交通模拟器,其具有与不同程度的侵略性程度相对应的行为丰富的轨迹。我们在驾驶员行为建模算法的帮助下生成这些轨迹。然后,我们使用丰富的模拟器培训深度加强学习(DRL)策略,包括一组高级车辆控制命令,并在测试时间使用此策略来执行密集流量的本地导航。我们的政策隐含地模拟了交通代理商之间的交互,并计算了自助式驾驶员机动,例如超速,超速,编织和突然道路变化的激进驾驶员演习的安全轨迹。我们增强的行为丰富的模拟器可用于生成由对应于不同驱动程序行为和流量密度的轨迹组成的数据集,我们的行为的导航方案可以与最先进的导航算法相结合。
translated by 谷歌翻译