分析高维损失函数的几何特性,例如局部曲率以及围绕损失空间某个特定点的其他Optima的存在,可以帮助您更好地理解神经网络结构,实现属性和学习绩效之间的相互作用。在这项工作中,我们将概念从高维概率和差异几何形状结合在一起,以研究低维损耗表示中的曲率特性如何取决于原始损失空间中的曲率。我们表明,如果使用随机投影,则很少在较低维表示中正确识别原始空间中的鞍点。在这样的预测中,较低维表示中的预期曲率与原始损耗空间中的平均曲率成正比。因此,原始损耗空间中的平均曲率决定了鞍点是否平均显示为最小值,最大值或几乎平坦的区域。我们使用预期曲率和平均曲率(即标准化的Hessian Trace)之间的连接来估计黑森的痕迹,而无需像Hutchinson的方法一样计算Hessian或Hessian-Vector产品。由于随机预测无法正确识别马鞍信息,因此我们建议沿着与最大和最小的主要曲线相关的Hessian指示进行预测。我们将发现与正在进行的有关损失景观平坦性和普遍性的辩论联系起来。最后,我们在不同图像分类器上的数值实验中说明了我们的方法,最高$ 7 \ times 10^6 $参数。
translated by 谷歌翻译
最佳控制问题自然出现在许多科学应用中,希望将动态系统从某个初始状态引导动态系统$ \ mathbf {x} _0 $到所需的目标状态$ \ mathbf {x}^*$有限时间$ t $ t $ 。深度学习和基于神经网络的优化的最新进展有助于开发可以帮助解决涉及高维动力系统的控制问题的方法。特别是,神经普通微分方程(神经ODE)的框架为迭代近似于与分析性棘手和计算要求的控制任务相关的连续时间控制功能提供了有效的手段。尽管神经ODE控制器在解决复杂的控制问题方面表现出了巨大的潜力,但对网络结构和优化器等超参数的影响的理解仍然非常有限。我们的工作旨在解决其中一些知识差距,以进行有效的超参数优化。为此,我们首先分析了如何通过时间进行截断和未截断的反向传播影响运行时性能以及神经网络学习最佳控制功能的能力。然后,我们使用分析和数值方法,然后研究参数初始化,优化器和神经网络体系结构的作用。最后,我们将结果与神经控制器隐式正规化控制能量的能力联系起来。
translated by 谷歌翻译
In the era of noisy intermediate scale quantum devices, variational quantum circuits (VQCs) are currently one of the main strategies for building quantum machine learning models. These models are made up of a quantum part and a classical part. The quantum part is given by a parametrization $U$, which, in general, is obtained from the product of different quantum gates. By its turn, the classical part corresponds to an optimizer that updates the parameters of $U$ in order to minimize a cost function $C$. However, despite the many applications of VQCs, there are still questions to be answered, such as for example: What is the best sequence of gates to be used? How to optimize their parameters? Which cost function to use? How the architecture of the quantum chips influences the final results? In this article, we focus on answering the last question. We will show that, in general, the cost function will tend to a typical average value the closer the parameterization used is from a $2$-design. Therefore, the closer this parameterization is to a $2$-design, the less the result of the quantum neural network model will depend on its parametrization. As a consequence, we can use the own architecture of the quantum chips to defined the VQC parametrization, avoiding the use of additional swap gates and thus diminishing the VQC depth and the associated errors.
translated by 谷歌翻译
Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}
translated by 谷歌翻译
We describe a Physics-Informed Neural Network (PINN) that simulates the flow induced by the astronomical tide in a synthetic port channel, with dimensions based on the Santos - S\~ao Vicente - Bertioga Estuarine System. PINN models aim to combine the knowledge of physical systems and data-driven machine learning models. This is done by training a neural network to minimize the residuals of the governing equations in sample points. In this work, our flow is governed by the Navier-Stokes equations with some approximations. There are two main novelties in this paper. First, we design our model to assume that the flow is periodic in time, which is not feasible in conventional simulation methods. Second, we evaluate the benefit of resampling the function evaluation points during training, which has a near zero computational cost and has been verified to improve the final model, especially for small batch sizes. Finally, we discuss some limitations of the approximations used in the Navier-Stokes equations regarding the modeling of turbulence and how it interacts with PINNs.
translated by 谷歌翻译
Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can ``leak'' onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译