As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译
多芯片芯片模块(MCM),而票面上提供性能和能效的单片大芯片减少了机器学习(ML)加速器的设计和制造成本。然而,统计MCM的ML编译器需要最佳,有效地解决复杂的优化问题,以实现这种高性能。其中一个问题是多芯片分割问题,在编译器确定在小芯片的MCM张计算图形操作的最佳分配和安置。作为搜索空间可用芯片的数目和节点的神经网络在数量呈指数级增长分区ML图形的多芯片模块是特别难。此外,由底层硬件施加的约束产生了一个有效解决方案非常稀疏的搜索空间。在本文中,我们提出使用深强化学习(RL)框架来发出可能无效分区候选人,然后由约束求解修正的策略。使用约束求解器可确保RL遇到稀疏空间中的有效解决方案,其经常足以与未经学习的策略相比较少的样本收敛。我们为策略网络制作的架构选择允许我们拓展不同的ML图形。我们的生产规模的模型,BERT,在真实的硬件的评估表明,使用RL政策所产生的分区达到6.11%和5.85%,比吞吐量随机搜索和模拟退火更高。此外,微调预训练RL政策减少了3小时至只有9分钟的搜索时间,同时实现了相同的吞吐量从头训练RL政策。
translated by 谷歌翻译