Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
Synaptic plasticity allows cortical circuits to learn new tasks and to adapt to changing environments. How do cortical circuits use plasticity to acquire functions such as decision-making or working memory? Neurons are connected in complex ways, forming recurrent neural networks, and learning modifies the strength of their connections. Moreover, neurons communicate emitting brief discrete electric signals. Here we describe how to train recurrent neural networks in tasks like those used to train animals in neuroscience laboratories, and how computations emerge in the trained networks. Surprisingly, artificial networks and real brains can use similar computational strategies.
translated by 谷歌翻译
Generative Adversarial Networks (GANs) were introduced by Goodfellow in 2014, and since then have become popular for constructing generative artificial intelligence models. However, the drawbacks of such networks are numerous, like their longer training times, their sensitivity to hyperparameter tuning, several types of loss and optimization functions and other difficulties like mode collapse. Current applications of GANs include generating photo-realistic human faces, animals and objects. However, I wanted to explore the artistic ability of GANs in more detail, by using existing models and learning from them. This dissertation covers the basics of neural networks and works its way up to the particular aspects of GANs, together with experimentation and modification of existing available models, from least complex to most. The intention is to see if state of the art GANs (specifically StyleGAN2) can generate album art covers and if it is possible to tailor them by genre. This was attempted by first familiarizing myself with 3 existing GANs architectures, including the state of the art StyleGAN2. The StyleGAN2 code was used to train a model with a dataset containing 80K album cover images, then used to style images by picking curated images and mixing their styles.
translated by 谷歌翻译
Solar forecasting from ground-based sky images using deep learning models has shown great promise in reducing the uncertainty in solar power generation. One of the biggest challenges for training deep learning models is the availability of labeled datasets. With more and more sky image datasets open sourced in recent years, the development of accurate and reliable solar forecasting methods has seen a huge growth in potential. In this study, we explore three different training strategies for deep-learning-based solar forecasting models by leveraging three heterogeneous datasets collected around the world with drastically different climate patterns. Specifically, we compare the performance of models trained individually based on local datasets (local models) and models trained jointly based on the fusion of multiple datasets from different locations (global models), and we further examine the knowledge transfer from pre-trained solar forecasting models to a new dataset of interest (transfer learning models). The results suggest that the local models work well when deployed locally, but significant errors are observed for the scale of the prediction when applied offsite. The global model can adapt well to individual locations, while the possible increase in training efforts need to be taken into account. Pre-training models on a large and diversified source dataset and transferring to a local target dataset generally achieves superior performance over the other two training strategies. Transfer learning brings the most benefits when there are limited local data. With 80% less training data, it can achieve 1% improvement over the local baseline model trained using the entire dataset. Therefore, we call on the efforts from the solar forecasting community to contribute to a global dataset containing a massive amount of imagery and displaying diversified samples with a range of sky conditions.
translated by 谷歌翻译
关于人际冲突的研究历史悠久,并包含许多有关冲突类型学的建议。我们将其用作新的注释方案的基础,并发布新的情况和冲突方面注释的新数据集。然后,我们构建一个分类器,以预测某人在给定情况下是否将一个人的行为视为对还是错,从而优于先前的此任务的工作。我们的分析包括冲突方面,但也产生了被人类验证的集群,并根据参与者与作者的关系显示冲突内容的差异。我们的发现对理解冲突和社会规范具有重要意义。
translated by 谷歌翻译
我们研究反对称函数的两个基本模型(或\ emph {ans \“ atze}),即表格$ f的函数$ f $(x _ {\ sigma(1)},\ ldots,x _ {\ f $sigma(n)})= \ text {sign}(\ sigma)f(x_1,\ ldots,x_n)$,其中$ \ sigma $是任何置换。这些都是在量子化学的背景下出现的,是基本的建模特定的费米子系统波函数的工具。具体来说,我们考虑了两个流行的反对称ANS \“ atze:Slater代表,它利用了决定因素的交替结构,以及Jastrow Ansatz,它们通过任意对称功能增强了用产品的Slater确定性。我们构建了一个可以更有效地以jastrow形式表达的反对称函数,但是除非有指数(以$ n^2 $为指数)许多术语,否则无法通过Slater决定因素近似。这代表了这两个Ans \“ atze之间的第一个显式定量分离。
translated by 谷歌翻译
数据集在机器学习(ML)模型的培训和评估中起着核心作用。但是它们也是许多不希望的模型行为的根本原因,例如有偏见的预测。为了克服这种情况,ML社区提出了一个以数据为中心的文化转变,在该转变中,将数据问题给予他们应有的关注,并且围绕数据集的收集和处理的更多标准实践开始讨论和建立。到目前为止,这些建议主要是自然语言中描述的高级准则,因此,它们很难形式化并适用于特定数据集。从这个意义上讲,受这些建议的启发,我们定义了一种新的特定领域语言(DSL),以精确描述机器学习数据集,以其结构,数据出处和社会关注。我们认为,该DSL将促进任何ML计划,以利用和受益于ML的这种以数据为中心的转移(例如,为新项目选择最合适的数据集或更好地复制其他ML结果)。 DSL被实现为视觉工作室代码插件,并已根据开源许可发布。
translated by 谷歌翻译
图表自学学习(GSSL)铺平了无需专家注释的学习图嵌入的方式,这对分子图特别有影响,因为可能的分子数量很大,并且标签昂贵。但是,通过设计,GSSL方法没有经过训练,可以在一个下游任务上表现良好,而是旨在将其转移到许多人方面,从而使评估不那么直接。作为获得具有多种多样且可解释属性的分子图嵌入曲线的一步,我们引入了分子图表示评估(Molgrapheval),这是一组探针任务,分为(i)拓扑 - ,(ii)子结构 - 和(iii)和(iii)嵌入空间属性。通过对现有下游数据集和Molgrapheval上的现有GSSL方法进行基准测试,我们发现单独从现有数据集中得出的结论与更细粒度的探测之间存在令人惊讶的差异,这表明当前的评估协议没有提供整个图片。我们的模块化,自动化的端到端GSSL管道代码将在接受后发布,包括标准化的图形加载,实验管理和嵌入评估。
translated by 谷歌翻译