当训练过度参数化的深网以进行分类任务时,已经广泛观察到,学到的功能表现出所谓的“神经崩溃”现象。更具体地说,对于倒数第二层的输出特征,对于每个类,课堂内特征会收敛到其平均值,而不同类别的手段表现出一定的紧密框架结构,这也与最后一层的分类器对齐。由于最后一层的特征归一化成为现代表示学习中的一种常见实践,因此,在这项工作中,我们从理论上证明了归一化特征的神经崩溃现象是合理的。基于不受约束的特征模型,我们通过限制球体上的所有特征和分类器来简化多级分类任务中的经验损失函数。在这种情况下,我们分析了riemannian优化问题在球体的产物上的非概念景观,从而显示出良性的全球景观,因为唯一的全球最小化器是神经崩溃的解决方案,而所有其他关键点是严格的鞍座。实用深网的实验结果证实了我们的理论,并证明可以通过特征归一化更快地学习更好的表示。
translated by 谷歌翻译
在线优化是一个完善的优化范式,旨在鉴于对以前的决策任务的正确答案,旨在做出一系列正确的决策。二重编程涉及一个分层优化问题,其中所谓的外部问题的可行区域受内部问题的解决方案集映射的限制。本文将这两个想法汇总在一起,并研究了在线双层优化设置,其中一系列随时间变化的二聚体问题又一个接一个地揭示了一个。我们将已知的单层在线算法的已知遗憾界限扩展到双重设置。具体而言,我们引入了新的杂种遗憾概念,开发了一种在线交替的时间平均梯度方法,该方法能够利用光滑度,并根据内部和外部极型序列的长度提供遗憾的界限。
translated by 谷歌翻译
K-Subspaces(KSS)方法是用于子空间聚类的K-均值方法的概括。在这项工作中,我们介绍了KSS的本地收敛分析和恢复保证,假设数据是由Smari-random的子空间模型生成的,其中$ n $点是从$ k \ ge 2 $重叠子空间随机采样的。我们表明,如果KSS方法的初始分配位于真实聚类的邻域内,则它以高等的速率收敛,并在$ \ theta(\ log \ log \ log n)$迭代中找到正确的群集。此外,我们提出了一种基于阈值的基于内部产品的光谱方法来初始化,并证明它在该社区中产生了一个点。我们还提出了研究方法的数值结果,以支持我们的理论发展。
translated by 谷歌翻译
当节点具有人口统计属性时,概率图形模型中社区结构的推理可能不会与公平约束一致。某些人口统计学可能在某些检测到的社区中过度代表,在其他人中欠代表。本文定义了一个新的$ \ ell_1 $ -regulared伪似然方法,用于公平图形模型选择。特别是,我们假设真正的基础图表​​中存在一些社区或聚类结构,我们寻求从数据中学习稀疏的无向图形及其社区,使得人口统计团体在社区内相当代表。我们的优化方法使用公平的人口统计奇偶校验定义,但框架很容易扩展到其他公平的定义。我们建立了分别,连续和二进制数据的高斯图形模型和Ising模型的提出方法的统计一致性,证明了我们的方法可以以高概率恢复图形及其公平社区。
translated by 谷歌翻译
学习如何有效地控制未知的动态系统对于智能自治系统至关重要。当潜在的动态随着时间的推移时,这项任务成为一个重大挑战。本文认为这一挑战,本文考虑了控制未知马尔可夫跳跃线性系统(MJS)的问题,以优化二次目标。通过采用基于模型的透视图,我们考虑对MJSS的识别自适应控制。我们首先为MJS提供系统识别算法,用于从系统状态,输入和模式的单个轨迹,从模式开关的演进中的底层中学习MJS的系统识别算法。通过混合时间参数,该算法的样本复杂性显示为$ \ mathcal {o}(1 / \ sqrt {t})$。然后,我们提出了一种自适应控制方案,其与确定性等效控制一起执行系统识别,以使控制器以焦化方式调整。 Combining our sample complexity results with recent perturbation results for certainty equivalent control, we prove that when the episode lengths are appropriately chosen, the proposed adaptive control scheme achieves $\mathcal{O}(\sqrt{T})$ regret, which can be改进了$ \ mathcal {o}(polylog(t))$与系统的部分了解。我们的证据策略介绍了在MJSS中处理马尔可维亚跳跃的创新和较弱的稳定概念。我们的分析提供了影响学习准确性和控制性能的系统理论量的见解。提出了数值模拟,以进一步加强这些见解。
translated by 谷歌翻译
监督主体组件分析(SPCA)的方法旨在将标签信息纳入主成分分析(PCA),以便提取的功能对于预测感兴趣的任务更有用。SPCA的先前工作主要集中在优化预测误差上,并忽略了提取功能解释的最大化方差的价值。我们为SPCA提出了一种新的方法,该方法共同解决了这两个目标,并从经验上证明我们的方法主导了现有方法,即在预测误差和变异方面都超越了它们的表现。我们的方法可容纳任意监督的学习损失,并通过统计重新制定提供了广义线性模型的新型低级扩展。
translated by 谷歌翻译
我们使用张量奇异值分解(T-SVD)代数框架提出了一种新的快速流算法,用于抵抗缺失的低管级张量的缺失条目。我们展示T-SVD是三阶张量的研究型块术语分解的专业化,我们在该模型下呈现了一种算法,可以跟踪从不完全流2-D数据的可自由子模块。所提出的算法使用来自子空间的基层歧管的增量梯度下降的原理,以解决线性复杂度和时间样本的恒定存储器的张量完成问题。我们为我们的算法提供了局部预期的线性收敛结果。我们的经验结果在精确态度上具有竞争力,但在计算时间内比实际应用上的最先进的张量完成算法更快,以在有限的采样下恢复时间化疗和MRI数据。
translated by 谷歌翻译
Models of sensory processing and learning in the cortex need to efficiently assign credit to synapses in all areas. In deep learning, a known solution is error backpropagation, which however requires biologically implausible weight transport from feed-forward to feedback paths. We introduce Phaseless Alignment Learning (PAL), a bio-plausible method to learn efficient feedback weights in layered cortical hierarchies. This is achieved by exploiting the noise naturally found in biophysical systems as an additional carrier of information. In our dynamical system, all weights are learned simultaneously with always-on plasticity and using only information locally available to the synapses. Our method is completely phase-free (no forward and backward passes or phased learning) and allows for efficient error propagation across multi-layer cortical hierarchies, while maintaining biologically plausible signal transport and learning. Our method is applicable to a wide class of models and improves on previously known biologically plausible ways of credit assignment: compared to random synaptic feedback, it can solve complex tasks with less neurons and learn more useful latent representations. We demonstrate this on various classification tasks using a cortical microcircuit model with prospective coding.
translated by 谷歌翻译
Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
translated by 谷歌翻译
After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, which inherently makes it difficult to estimate the right probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a crude heuristic raises the question: Rather than wasting precious compute resources and model capacity for learning this strategy at early training stages, can we initialise our models with this behaviour? Here, we show that we can effectively endow our model with a separate module that reflects unigram frequency statistics as prior knowledge. Standard neural language generation architectures offer a natural opportunity for implementing this idea: by initialising the bias term in a model's final linear layer with the log-unigram distribution. Experiments in neural machine translation demonstrate that this simple technique: (i) improves learning efficiency; (ii) achieves better overall performance; and (iii) appears to disentangle strong frequency effects, encouraging the model to specialise in non-frequency-related aspects of language.
translated by 谷歌翻译