Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.
translated by 谷歌翻译
在强化学习文献中,有许多用于上下文强盗(CB)或马尔可夫决策过程(MDP)环境的算法。但是,当在现实世界中部署强化学习算法时,即使有领域专业知识,通常也很难知道将顺序决策问题视为CB或MDP是否适合。换句话说,行动会影响未来的状态,还是仅影响即时奖励?关于环境的性质做出错误的假设可能会导致学习效率低下,甚至可以阻止该算法学习最佳政策,即使使用无限数据。在这项工作中,我们开发了一种在线算法,该算法使用贝叶斯假设测试方法来学习环境的性质。我们的算法允许从业人员合并有关环境是否是CB还是MDP的知识,并有效地在经典CB和基于MDP的算法之间插值,以减轻对环境分配环境的影响。我们进行仿真并证明,在CB设置中,我们的算法比基于MDP的算法降低了遗憾,而在非Bandit MDP设置中,我们的算法能够学习最佳策略,通常可以与基于MDP的算法相当地遗憾。
translated by 谷歌翻译
我们采用了近端迭代,以便在加固学习中进行价值函数优化。近端迭代是一种计算上有效的技术,使我们能够向更理想的解决方案偏置优化过程。作为近端迭代在深增强学习中的具体应用,我们将深度Q-Network(DQN)代理具有近期术语的目标函数,以确保DQN的在线网络组件仍保留在目标网络附近。我们用近端迭代调用DQN或DQNPRO的所得代理,在ATARI基准测试中对原始DQN的显着改进。我们的结果强调了采用深度增强学习的声音优化技术的力量。
translated by 谷歌翻译
顺序决策的违规政策评估方法可用于帮助识别拟议的决策政策优于当前基线政策。但是,新的决策政策可能比某些人的基线政策更好,但不是其他人。这有动力推动个性化和准确的单态治疗效果估算(HTES)。鉴于许多重要应用中存在的有限数据,个体预测可以以准确性和在这种预测中的准确性和置信度的成本。通过识别子组,我们开发一种平衡对个人化的需求,以通过识别相对于基线的新决策政策中的预期差异来自信地估计预期估计。我们提出了一种新的损失函数,用于在子组分区阶段期间的不确定性。在实验中,我们表明我们的方法可用于形成其他方法斗争的HTES的准确预测。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.
translated by 谷歌翻译
Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
translated by 谷歌翻译
Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
translated by 谷歌翻译
A digital twin is defined as a virtual representation of a physical asset enabled through data and simulators for real-time prediction, optimization, monitoring, controlling, and improved decision-making. Unfortunately, the term remains vague and says little about its capability. Recently, the concept of capability level has been introduced to address this issue. Based on its capability, the concept states that a digital twin can be categorized on a scale from zero to five, referred to as standalone, descriptive, diagnostic, predictive, prescriptive, and autonomous, respectively. The current work introduces the concept in the context of the built environment. It demonstrates the concept by using a modern house as a use case. The house is equipped with an array of sensors that collect timeseries data regarding the internal state of the house. Together with physics-based and data-driven models, these data are used to develop digital twins at different capability levels demonstrated in virtual reality. The work, in addition to presenting a blueprint for developing digital twins, also provided future research directions to enhance the technology.
translated by 谷歌翻译
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that - across 6 architectures trained with 4 algorithms on massive datasets - they exhibit little compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark CREPE which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of three test datasets. The three test sets are designed to test models trained on three of the popular training datasets: CC-12M, YFCC-15M, and LAION-400M. They contain 385K, 385K, and 373K image-text pairs and 237K, 210K, and 178K hard negative captions. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 246K hard negative captions with atomic, swapping, and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 8%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
translated by 谷歌翻译