To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
随机梯度算法在大规模学习和推理问题中广泛用于优化和采样。但是,实际上,调整这些算法通常是使用启发式和反复试验而不是严格的,可概括的理论来完成的。为了解决理论和实践之间的这一差距,我们通过表征具有固定步长的非常通用的预处理随机梯度算法的迭代术的大样本行为来对调整参数的效果进行新的见解。在优化设置中,我们的结果表明,具有较大固定步长的迭代平均值可能会导致(局部)M-静态器的统计效率近似。在抽样环境中,我们的结果表明,通过适当的调整参数选择,限制固定协方差可以与Bernstein匹配 - 后验的von Mises限制,对模型错误指定后验的调整或MLE的渐近分布;而幼稚的调整极限与这些都不相对应。此外,我们认为可以在数据集对固定数量的通行证后获得基本独立的样本。我们使用模拟和真实数据通过多个实验来验证渐近样结果。总体而言,我们证明具有恒定步长的正确调整的随机梯度算法为获得点估计或后部样品提供了计算上有效且统计上健壮的方法。
translated by 谷歌翻译
我们研究了学习算法的输出及其$ n $培训数据之间(某些摘要)之间的共同信息,以$ n+1 $ i.i.d.的超级样本为条件。随机选择训练数据而无需更换的数据。这些算法(Steinke and Zakynthinou,2020)的条件相互信息(CMI)的这些剩余变体也被认为可以控制具有有界损耗函数的学习算法的平均通用误差。为了学习在0-1损失(即插值算法)下实现零经验风险的学习算法,我们提供了剩余的CMI与风险的经典保留误差估计之间的明确联系。使用此连接,我们就(评估)保留的CMI获得了上限和下限。当限制风险恒定或多项式衰减时,边界会收敛到两个恒定因子。作为应用程序,我们分析了单个包含图算法的人口风险,这是一种在可实现的环境中的VC类的通用转导学习算法。使用一对一的CMI,我们匹配在可实现的设置中学习VC课程的最佳界限,回答了Steinke和Zakynthinou(2020)提出的开放挑战。最后,为了理解剩余的CMI在研究概括中的作用,我们将剩余的CMI放在措施层次结构中,并在根本上使用新颖的无条件相互信息。对于0-1的损失和插值学习算法,观察到此相互信息恰恰是风险。
translated by 谷歌翻译
The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
translated by 谷歌翻译
在这项工作中,我们调查了Steinke和Zakynthinou(2020)的“条件互信息”(CMI)框架的表现力,以及使用它来提供统一框架,用于在可实现的环境中证明泛化界限。我们首先证明可以使用该框架来表达任何用于从一类界限VC维度输出假设的任何学习算法的非琐碎(但是次优)界限。我们证明了CMI框架在用于学习半个空间的预期风险上产生最佳限制。该结果是我们的一般结果的应用,显示稳定的压缩方案Bousquet al。 (2020)尺寸$ k $有统一有限的命令$ o(k)$。我们进一步表明,适当学习VC类的固有限制与恒定的CMI存在适当的学习者的存在,并且它意味着对Steinke和Zakynthinou(2020)的开放问题的负面分辨率。我们进一步研究了价值最低限度(ERMS)的CMI的级别$ H $,并表明,如果才能使用有界CMI输出所有一致的分类器(版本空间),只有在$ H $具有有界的星号(Hanneke和杨(2015)))。此外,我们证明了一般性的减少,表明“休假”分析通过CMI框架表示。作为推论,我们研究了Haussler等人提出的一包图算法的CMI。 (1994)。更一般地说,我们表明CMI框架是通用的,因为对于每一项一致的算法和数据分布,当且仅当其评估的CMI具有样品的载位增长时,预期的风险就会消失。
translated by 谷歌翻译
分位数(更普遍,KL)遗憾的界限,例如由癌症(Chaudhuri,Freund和Hsu 2009)及其变体实现的界限,放松了竞争最佳个别专家的目标,只能争夺大多数专家对抗性数据。最近,通过考虑可能既完全对抗或随机(i.i.D.),半对抗拉利范式(Bilodeau,Negrea和Roy 2020)提供了对抗性在线学习的替代放松。我们使用FTRL与单独的,新颖的根对数常规常规程序一起实现SIMIMAX最佳遗憾,这两者都可以解释为QuanchEdge的屈服变体。我们扩展了现有的KL遗憾的上限,统一地持有目标分布,可能是具有任意前锋的不可数专家课程;在有限的专家课程(紧密)上为Simitile遗憾提供第一个全信息下限;并为半逆势范式提供适应性最低的最低限度最佳算法,其适应真实,未知的约束更快,导致在现有方法上均匀改进遗憾。
translated by 谷歌翻译
当在未知约束集中任意变化的分布中生成数据时,我们会考虑使用专家建议的预测。这种半反向的设置包括(在极端)经典的I.I.D.设置时,当未知约束集限制为单身人士时,当约束集是所有分布的集合时,不受约束的对抗设置。对冲状态中,对冲算法(长期以来已知是最佳的最佳速率(速率))最近被证明是对I.I.D.的最佳最小值。数据。在这项工作中,我们建议放松I.I.D.通过在约束集的所有自然顺序上寻求适应性来假设。我们在各个级别的Minimax遗憾中提供匹配的上限和下限,表明确定性学习率的对冲在极端之外是次优的,并证明人们可以在各个级别的各个层面上都能适应Minimax的遗憾。我们使用以下规范化领导者(FTRL)框架实现了这种最佳适应性,并采用了一种新型的自适应正则化方案,该方案隐含地缩放为当前预测分布的熵的平方根,而不是初始预测分布的熵。最后,我们提供了新的技术工具来研究FTRL沿半逆转频谱的统计性能。
translated by 谷歌翻译
We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).
translated by 谷歌翻译
Diabetic Retinopathy (DR) is a leading cause of vision loss in the world, and early DR detection is necessary to prevent vision loss and support an appropriate treatment. In this work, we leverage interactive machine learning and introduce a joint learning framework, termed DRG-Net, to effectively learn both disease grading and multi-lesion segmentation. Our DRG-Net consists of two modules: (i) DRG-AI-System to classify DR Grading, localize lesion areas, and provide visual explanations; (ii) DRG-Expert-Interaction to receive feedback from user-expert and improve the DRG-AI-System. To deal with sparse data, we utilize transfer learning mechanisms to extract invariant feature representations by using Wasserstein distance and adversarial learning-based entropy minimization. Besides, we propose a novel attention strategy at both low- and high-level features to automatically select the most significant lesion information and provide explainable properties. In terms of human interaction, we further develop DRG-Net as a tool that enables expert users to correct the system's predictions, which may then be used to update the system as a whole. Moreover, thanks to the attention mechanism and loss functions constraint between lesion features and classification features, our approach can be robust given a certain level of noise in the feedback of users. We have benchmarked DRG-Net on the two largest DR datasets, i.e., IDRID and FGADR, and compared it to various state-of-the-art deep learning networks. In addition to outperforming other SOTA approaches, DRG-Net is effectively updated using user feedback, even in a weakly-supervised manner.
translated by 谷歌翻译