低级和非平滑矩阵优化问题捕获了统计和机器学习中的许多基本任务。尽管近年来在开发\ textIt {平滑}低级优化问题的有效方法方面取得了重大进展,这些问题避免了保持高级矩阵和计算昂贵的高级SVD,但不平滑问题的进步的步伐缓慢。在本文中,我们考虑了针对此类问题的标准凸放松。主要是,我们证明,在\ textit {严格的互补性}条件下,在相对温和的假设下,非平滑目标可以写成最大的光滑功能,近似于两个流行的\ textit {mirriry-prox}方法的变体: \ textIt {外部方法}和带有\ textIt {矩阵启用梯度更新}的镜像 - prox,当用“温暖启动”初始化时,将速率$ o(1/t)$的最佳解决方案收集到最佳解决方案,同时仅需要两个\ textIt {low-rank} svds每迭代。此外,对于外部方法,我们还考虑了严格互补性的放松版本,该版本在所需的SVD等级与我们需要初始化该方法的球的半径之间取决于权衡。我们通过几个非平滑级矩阵恢复任务的经验实验来支持我们的理论结果,这既证明了严格的互补性假设的合理性,又证明了我们所提出的低级镜像 - 镜像变体的有效收敛。
translated by 谷歌翻译
泰勒(Tyler)的M-估计器是一种众所周知的稳健和重尾协方差估计的程序。泰勒本人提出了一种用于计算其估计器的迭代定点算法,但是,它需要超级线性(按数据的大小)运行时进行运行时,这可能是大规模的。在这项工作中,据我们所知,我们提出了第一个用于计算泰勒估计器的算法的第一个基于弗兰克 - 沃尔夫的算法。一个变体使用标准的Frank-Wolfe步骤,第二个变体还考虑了\ textit {avey-steps}(afw),第三个是afw(gafw)的\ textit {geodesic}版本。 AFW可证明,最多需要日志系数,每次迭代仅线性时间,而GAFW则以线性时间(最高为日志系数)运行,以$ n $ n $(数量的数据点)制度运行。在标准假设下,所有三个变体都显示出具有肌关系速率的最佳解决方案,尽管基础优化问题不是凸或平滑的。在额外的相当温和的假设下,当(归一化)数据点为I.I.D时,它具有概率1。事实证明,来自整个单元球体,AFW和GAFW的连续分布的样品被证明与线性速率相聚。重要的是,所有三个变体都是无参数的,并且使用自适应步骤尺寸。
translated by 谷歌翻译
我们提出了用于在线凸优化(OCO)的新的有效\ textit {无投影}算法,在此,通过无投影,我们参考避免计算可行集合的算法,而是在可行的集合上进行计算,而是在不同的,潜在的更有效的牙文上转移。虽然大多数最先进的无投影算法基于\ textit {laste-the-the-leader}框架,但我们的算法从根本上不同,并且基于带有小说和小说和小说和小说的\ textit {在线渐变下降}算法计算所谓的\ textit {不可行的投影}的有效方法。结果,我们获得了第一个自然产生\ textit {自适应遗憾}的第一个无投影算法,即保证,即遗憾的界限持有W.R.T.序列的任何子间隙。具体而言,当假设可行集合的线性优化甲骨文(loo)时,在一系列长度$ t $上时,我们的算法保证$ o(t^{3/4})$适应性遗憾和$ o(t ^{3/4})$自适应预期遗憾,分别仅使用$ o(t)$调用对厕所的全面信息和强盗设置。这些界限匹配了当前的最新遗憾范围,用于基于loo的投影的OCO,它是\ textit {不自适应}。我们还考虑了一种新的自然环境,其中可行的集合可以通过分离的甲骨文访问。我们提出算法,使用总体$ o(t)$调用分离甲骨文,保证$ o(\ sqrt {t})$自适应遗憾和$ o(t^{3/4})$适应性预期的遗憾分别全面信息和匪徒设置。
translated by 谷歌翻译
我们考虑凸优化问题,这些问题被广泛用作低级基质恢复问题的凸松弛。特别是,在几个重要问题(例如相位检索和鲁棒PCA)中,在许多情况下的基本假设是最佳解决方案是排名一列。在本文中,我们考虑了目标上的简单自然的条件,以使这些放松的最佳解决方案确实是独特的,并且是一个排名。主要是,我们表明,在这种情况下,使用线路搜索的标准Frank-Wolfe方法(即,没有任何参数调整),该方法仅需要单个排名一级的SVD计算,可以找到$ \ epsilon $ - 仅在$ o(\ log {1/\ epsilon})$迭代(而不是以前最著名的$ o(1/\ epsilon)$)中的近似解决方案,尽管目的不是强烈凸。我们考虑了基本方法的几种变体,具有改善的复杂性,以及由强大的PCA促进的扩展,最后是对非平滑问题的扩展。
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
translated by 谷歌翻译
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
translated by 谷歌翻译
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
translated by 谷歌翻译
Most existing pruning works are resource-intensive, requiring retraining or fine-tuning of the pruned models for accuracy. We propose a retraining-free pruning method based on hyperspherical learning and loss penalty terms. The proposed loss penalty term pushes some of the model weights far from zero, while the rest weight values are pushed near zero and can be safely pruned with no need for retraining and a negligible accuracy drop. In addition, our proposed method can instantly recover the accuracy of a pruned model by replacing the pruned values with their mean value. Our method obtains state-of-the-art results in retraining-free pruning and is evaluated on ResNet-18/50 and MobileNetV2 with ImageNet dataset. One can easily get a 50\% pruned ResNet18 model with a 0.47\% accuracy drop. With fine-tuning, the experiment results show that our method can significantly boost the accuracy of the pruned models compared with existing works. For example, the accuracy of a 70\% pruned (except the first convolutional layer) MobileNetV2 model only drops 3.5\%, much less than the 7\% $\sim$ 10\% accuracy drop with conventional methods.
translated by 谷歌翻译
Most of the existing works use projection functions for ternary quantization in discrete space. Scaling factors and thresholds are used in some cases to improve the model accuracy. However, the gradients used for optimization are inaccurate and result in a notable accuracy gap between the full precision and ternary models. To get more accurate gradients, some works gradually increase the discrete portion of the full precision weights in the forward propagation pass, e.g., using temperature-based Sigmoid function. Instead of directly performing ternary quantization in discrete space, we push full precision weights close to ternary ones through regularization term prior to ternary quantization. In addition, inspired by the temperature-based method, we introduce a re-scaling factor to obtain more accurate gradients by simulating the derivatives of Sigmoid function. The experimental results show that our method can significantly improve the accuracy of ternary quantization in both image classification and object detection tasks.
translated by 谷歌翻译