A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.
translated by 谷歌翻译
Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.
translated by 谷歌翻译
Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-$k$ operation's results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue, we propose Decentralized CEM (DecentCEM), a simple but effective improvement over classical CEM, by using an ensemble of CEM instances running independently from one another, and each performing a local improvement of its own sampling distribution. We provide both theoretical and empirical analysis to demonstrate the effectiveness of this simple decentralized approach. We empirically show that, compared to the classical centralized approach using either a single or even a mixture of Gaussian distributions, our DecentCEM finds the global optimum much more consistently thus improves the sample efficiency. Furthermore, we plug in our DecentCEM in the planning problem of MBRL, and evaluate our approach in several continuous control environments, with comparison to the state-of-art CEM based MBRL approaches (PETS and POPLIN). Results show sample efficiency improvement by simply replacing the classical CEM module with our DecentCEM module, while only sacrificing a reasonable amount of computational cost. Lastly, we conduct ablation studies for more in-depth analysis. Code is available at https://github.com/vincentzhang/decentCEM
translated by 谷歌翻译
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
translated by 谷歌翻译
Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations are released at https://github.com/ekinakyurek/google-research/blob/master/incontext.
translated by 谷歌翻译
在连续空间中,已经对大都市杂货(M-H)算法进行了充分的研究,但在离散空间中缺乏类似的理解。最近,事实证明,一个本地平衡的建议(LBP)是渐进的最佳选择,但最佳缩放问题仍然开放。在本文中,我们首次确定离散空间中M-H的效率也可以以独立于目标分布的渐近可接受率来表征。此外,我们从理论和经验上验证了LBP和Randy Walk Metropolis(RWM)的最佳接受率分别为$ 0.574 $和0.234美元。这些结果还有助于确定LBP是渐近的$ o(n^\ frac {2} {3})$比RWM相对于模型尺寸$ n $更有效。了解最佳接受率的知识使人们可以在离散空间中自动调整提案分布的邻域大小,直接类似于连续空间中的尺寸控制。我们从经验上证明,这种适应性M-H采样可以在离散空间中的各种目标分布(包括训练深度能量模型)中的各种目标分布中进行稳健改进采样。
translated by 谷歌翻译
表示学习通常通过管理维度的诅咒在加强学习中起关键作用。代表性的算法类别利用了随机过渡动力学的光谱分解,以构建在理想化环境中具有强大理论特性的表示。但是,当前的光谱方法的适用性有限,因为它们是用于仅国家的聚合并源自策略依赖性过渡内核的,而无需考虑勘探问题。为了解决这些问题,我们提出了一种替代光谱方法,光谱分解表示(SPEDER),该方法从动力学中提取了国家行动抽象而不诱导虚假依赖数据收集策略,同时还可以平衡探索访问权分析交易 - 在学习过程中关闭。理论分析确定了在线和离线设置中所提出的算法的样本效率。此外,一项实验研究表明,在几个基准测试中,比当前的最新算法表现出色。
translated by 谷歌翻译
通常通过利用低级别表示来解决马尔可夫决策过程(MDP)中维度的诅咒。这激发了有关线性MDP的最新理论研究。但是,大多数方法在不切实际的假设下对分解的归一化或在实践中引入未解决的计算挑战。相反,我们考虑了线性MDP的替代定义,该定义自动确保正常化,同时允许通过对比度估计进行有效的表示。该框架还承认了置信度调整的索引算法,从而使面对不确定性的乐观或悲观主义,使得有效而有原则的方法。据我们所知,这为线性MDP提供了第一种实用的表示学习方法,该方法既可以实现强大的理论保证和经验绩效。从理论上讲,我们证明所提出的算法在在线和离线设置中均有效。从经验上讲,我们在几个基准测试中表现出优于现有基于模型的现有模型和无模型算法的卓越性能。
translated by 谷歌翻译
最近的研究表明,理性或逐步思想链可用于改善多步推理任务的性能。我们重新考虑了理由的提示,提示了几次射击中的内部学习学习,其中(输入 - >输出)提示将扩展到(输入,理由 - >输出)提示。对于以理由为提示的提示,我们证明了现有的方法(依赖手动及时工程)如何受到可能损害绩效的次级理由。为了减轻这种脆弱性,我们提出了一个统一的授权合奏的统一框架,在该框架中,我们将输出空间中的理由抽样确定为可鲁棒提高性能的关键组成部分。该框架是一般的,可以轻松地扩展到常见的自然语言处理任务,即使传统上不利于中间步骤的任务,例如问题回答,单词感官歧义和情感分析。我们证明,与现有的提示方法相比,以理由为原理的合奏获得了更准确和可解释的结果 - 包括标准提示,没有理由和基于理由的链链链,同时通过相关理性同时提高了模型预测的解释性。
translated by 谷歌翻译
最近,一个本地平衡(LB)的样本家族在离散空间中的采样和学习能量模型(EBM)方面表现出色。但是,对这一成功的理论理解是有限的。在这项工作中,我们展示了LB功能如何引起与离散空间中Wasserstein梯度流相对应的LB动力学。从第一原则来看,先前的LB采样器就可以看作是LB动力学相对于锤距的离散化。基于此观察结果,我们提出了一种新算法,即局部平衡跳跃(LBJ),通过将LB动力学相对于仿真时间离散。结果,LBJ具有位置依赖性的“速度”,使其可以提出更大距离的建议。此外,LBJ将每个维度分解为独立的子过程,从而实现方便的并行实现。我们证明了LBJ在各种二进制和分类分布中的采样和学习方面的优势。
translated by 谷歌翻译