Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4x relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.
Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP's capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend heavily on a single parameter $\beta_2$. Our empirical experiments support our claims and provide practical insights into the stability and generalization properties of adaptive optimization methods.
权重规范$ \ | w \ | $和保证金$ \ gamma $通过归一化的保证金$ \ gamma/\ | w \ | $参与学习理论。由于标准神经净优化器不能控制归一化的边缘,因此很难测试该数量是否与概括有关。本文设计了一系列实验研究,这些研究明确控制了归一化的边缘,从而解决了两个核心问题。首先:归一化的边缘是否总是对概括产生因果影响?本文发现,在归一化的边缘似乎与概括没有关系的情况下,可以与Bartlett等人的理论背道而驰。(2017)。第二:标准化边缘是否对概括有因果影响?该论文发现是的 - 在标准培训设置中,测试性能紧密跟踪了标准化的边距。该论文将高斯流程模型表示为这种行为的有前途的解释。
自适应优化方法已成为许多机器学习任务的默认求解器。不幸的是,适应性的好处可能会在具有不同隐私的训练时降低,因为噪声增加了,以确保隐私会降低自适应预处理的有效性。为此,我们提出了ADADP,这是一个使用非敏感的侧面信息来预处梯度的一般框架,从而可以在私有设置中有效使用自适应方法。我们正式显示ADADPS减少了获得类似隐私保证所需的噪声量,从而提高了优化性能。从经验上讲,我们利用简单且随时可用的侧面信息来探索实践中ADADP的性能,与集中式和联合设置中的强大基线相比。我们的结果表明,ADADP平均提高了准确性7.7%(绝对) - 在大规模文本和图像基准上产生最先进的隐私性权衡权衡。
与SGD相比,Adam等自适应梯度方法允许对现代深层网络(尤其是大型语言模型)进行强有力的培训。但是,适应性的使用不仅是为了额外的记忆,而且还提出了一个基本问题:SGD等非自适应方法可以享受类似的好处吗?在本文中,我们通过提议通过以下一般配方提议实现健壮和记忆效率的培训来为这个问题提供肯定的答案:(1)修改体系结构并使IT规模不变,即参数规模不影响。网络的输出,(2)使用SGD和重量衰减的训练,以及(3)剪辑全局梯度标准与重量标准成比例成正比,乘以$ \ sqrt {\ tfrac {\ tfrac {2 \ lambda} {\ eta}} {\ eta}}} $, $ \ eta $是学习率,而$ \ lambda $是权重腐烂。我们表明,这种一般方法是通过证明其收敛性仅取决于初始化和损失的规模来重新恢复参数和丢失的强大,而标准SGD甚至可能不会收敛许多初始化。在我们的食谱之后,我们设计了一个名为Sibert的Bert版本的比例不变版本,该版本仅由Vanilla SGD进行训练时,可以实现与Bert在下游任务中受过自适应方法训练的BERT相当的性能。
