The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
translated by 谷歌翻译
嵌入表占主导地位的工业规模推荐模型大小,最多可以使用内存的记忆。推荐数据上的一种流行和最大的公共机器学习MLPERF基准是一种深度学习推荐模型(DLRM),该模型(DLRM)对单击数据的Terabyte进行了培训。它包含100GB的嵌入内存(25亿个参数)。 DLRS由于其纯粹的大小和相关的数据量,因此由于嵌入式表较大而面临训练,进行推理和内存瓶颈的困难。本文分析并广泛评估用于压缩DLRM模型的通用参数共享设置(PSS)。我们在嵌入式表上的$(1 \ pm \ epsilon)$近似值的可学习记忆要求上显示了理论上的上限。我们的边界表明,较少的参数足以准确。为此,我们演示了PSS DLRM在Criteo-TB上达到10000 $ \ times $压缩而不会失去质量。但是,这样的压缩带有警告。它需要4.5 $ \ times $更多的迭代才能达到相同的饱和质量。该论文认为,这种权衡需要更多的调查,因为它可能非常有利。利用压缩型号的小尺寸,我们显示了4.3 $ \ times $改进的训练潜伏期,从而获得类似的整体培训时间。因此,在小型DLRM模型与较慢收敛的系统优势之间的权衡中,我们表明量表倾向于具有较小的DLRM模型,从而导致推理,更容易的部署和类似的培训时间。
translated by 谷歌翻译
深度学习的进步通常与增加模型大小有关。模型大小极大地影响了深层模型的部署成本和延迟。例如,由于伯特(Bert)的尺寸,不能将诸如伯特(Bert)之类的模型部署在边缘设备和手机上。结果,深度学习的大多数进步尚未达到优势。模型压缩已在自然语言处理,视觉和推荐域的文献中寻求当之无愧的关注。本文提出了一种模型不合时宜的,对缓存的模型压缩方法:随机操作访问特定的瓷砖(烤)哈希。烘烤通过轻巧的映射来敲击参数,从而折叠了这些参数。值得注意的是,在敲击这些参数时,烤肉通过将内存访问模式与参数访问模式对齐来利用缓存层次结构。烤器最多可训练的速度最多$ \ sim 25 \ times $ $,而$ \ sim 50 \ times $ the times $比流行的参数共享方法hashednet更快。此外,烤肉还引入了全球重量共享,从经验和理论上,它优于Hashednet的本地重量共享,并且本身可能具有独立的兴趣。使用烤肉,我们会出示第一个压缩的伯特,即$ 100 \ times -1000 \ times $ $较小,但不会导致质量降解。这些在通用体系结构(例如变形金刚)上的压缩级别对于SOTA模型在移动设备等资源受限设备上的未来是有希望的
translated by 谷歌翻译
条件梯度方法(CGM)广泛用于现代机器学习。 CGM的整体运行时间通常由两部分组成:迭代次数和每次迭代的成本。大多数努力侧重于减少迭代的数量,作为减少整体运行时间的手段。在这项工作中,我们专注于改善CGM的迭代成本。大多数CGM中的瓶颈步骤是最大内部产品搜索(MAXIP),需要在参数上线性扫描。在实践中,发现近似的maxip数据结构是有用的启发式。然而,理论上,关于近似的MAIPIP数据结构和CGM的组合,没有任何内容。在这项工作中,我们通过提供一个正式的框架来肯定地回答这个问题,以将临时敏感散列类型近似maxip数据结构与CGM算法组合起来。结果,我们展示了第一算法,其中每个迭代的成本在参数的数量中,对于许多基本优化算法,例如Frank-Wolfe,emergorithm和政策梯度。
translated by 谷歌翻译
Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12X, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6X decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.
translated by 谷歌翻译
This work presents a physics-informed deep learning-based super-resolution framework to enhance the spatio-temporal resolution of the solution of time-dependent partial differential equations (PDE). Prior works on deep learning-based super-resolution models have shown promise in accelerating engineering design by reducing the computational expense of traditional numerical schemes. However, these models heavily rely on the availability of high-resolution (HR) labeled data needed during training. In this work, we propose a physics-informed deep learning-based framework to enhance the spatial and temporal resolution of coarse-scale (both in space and time) PDE solutions without requiring any HR data. The framework consists of two trainable modules independently super-resolving the PDE solution, first in spatial and then in temporal direction. The physics based losses are implemented in a novel way to ensure tight coupling between the spatio-temporally refined outputs at different times and improve framework accuracy. We analyze the capability of the developed framework by investigating its performance on an elastodynamics problem. It is observed that the proposed framework can successfully super-resolve (both in space and time) the low-resolution PDE solutions while satisfying physics-based constraints and yielding high accuracy. Furthermore, the analysis and obtained speed-up show that the proposed framework is well-suited for integration with traditional numerical methods to reduce computational complexity during engineering design.
translated by 谷歌翻译
Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.
translated by 谷歌翻译
This paper aims to provide an unsupervised modelling approach that allows for a more flexible representation of text embeddings. It jointly encodes the words and the paragraphs as individual matrices of arbitrary column dimension with unit Frobenius norm. The representation is also linguistically motivated with the introduction of a novel similarity metric. The proposed modelling and the novel similarity metric exploits the matrix structure of embeddings. We then go on to show that the same matrices can be reshaped into vectors of unit norm and transform our problem into an optimization problem over the spherical manifold. We exploit manifold optimization to efficiently train the matrix embeddings. We also quantitatively verify the quality of our text embeddings by showing that they demonstrate improved results in document classification, document clustering, and semantic textual similarity benchmark tests.
translated by 谷歌翻译
Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our re-ranking technique outperforms state-of-the-art method on the AmbigQA dataset.
translated by 谷歌翻译
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured texts. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin.
translated by 谷歌翻译