Recently, many neural network-based image compression methods have shown promising results superior to the existing tool-based conventional codecs. However, most of them are often trained as separate models for different target bit rates, thus increasing the model complexity. Therefore, several studies have been conducted for learned compression that supports variable rates with single models, but they require additional network modules, layers, or inputs that often lead to complexity overhead, or do not provide sufficient coding efficiency. In this paper, we firstly propose a selective compression method that partially encodes the latent representations in a fully generalized manner for deep learning-based variable-rate image compression. The proposed method adaptively determines essential representation elements for compression of different target quality levels. For this, we first generate a 3D importance map as the nature of input content to represent the underlying importance of the representation elements. The 3D importance map is then adjusted for different target quality levels using importance adjustment curves. The adjusted 3D importance map is finally converted into a 3D binary mask to determine the essential representation elements for compression. The proposed method can be easily integrated with the existing compression models with a negligible amount of overhead increase. Our method can also enable continuously variable-rate compression via simple interpolation of the importance adjustment curves among different quality levels. The extensive experimental results show that the proposed method can achieve comparable compression efficiency as those of the separately trained reference compression models and can reduce decoding time owing to the selective compression. The sample codes are publicly available at https://github.com/JooyoungLeeETRI/SCR.
translated by 谷歌翻译
最近的成功表明,可以通过文本提示来操纵图像,例如,在雨天的晴天,在雨天中被操纵到同一场景中,这是由文本输入“下雨”驱动的雨天。这些方法经常利用基于样式的图像生成器,该生成器利用多模式(文本和图像)嵌入空间。但是,我们观察到,这种文本输入通常在提供和综合丰富的语义提示时被瓶颈瓶颈,例如将大雨与雨雨区分开。为了解决这个问题,我们主张利用另一种方式,声音,在图像操纵中具有显着优势,因为它可以传达出比文本更多样化的语义提示(生动的情感或自然世界的动态表达)。在本文中,我们提出了一种新颖的方法,该方法首先使用声音扩展了图像文本接头嵌入空间,并应用了一种直接的潜在优化方法来根据音频输入(例如雨的声音)操纵给定的图像。我们的广泛实验表明,我们的声音引导的图像操纵方法在语义和视觉上比最先进的文本和声音引导的图像操纵方法产生更合理的操作结果,这通过我们的人类评估进一步证实。我们的下游任务评估还表明,我们学到的图像文本单嵌入空间有效地编码声音输入。
translated by 谷歌翻译
最近最近的半监督学习(SSL)研究建立了教师学生的建筑,并通过教师产生的监督信号训练学生网络。数据增强策略在SSL框架中发挥着重要作用,因为很难在不丢失标签信息的情况下创建弱强度增强的输入对。特别是当将SSL扩展到半监督对象检测(SSOD)时,许多与图像几何和插值正则化相关的强大增强方法很难利用,因为它们可能损坏了对象检测任务中的边界框的位置信息。为解决此问题,我们介绍了一个简单但有效的数据增强方法,MIX / unmix(MUM),其中解密为SSOD框架的混合图像块的瓷砖。我们所提出的方法使混合输入图像块进行混合输入图像块,并在特征空间中重建它们。因此,妈妈可以从未插入的伪标签享受插值正则化效果,并成功地生成有意义的弱强对。此外,妈妈可以容易地配备各种SSOD方法。在MS-Coco和Pascal VOC数据集上的广泛实验通过在所有测试的SSOD基准协议中始终如一地提高基线的地图性能,证明了妈妈的优越性。
translated by 谷歌翻译
Recent deep learning models are difficult to train using a large batch size, because commodity machines may not have enough memory to accommodate both the model and a large data batch size. The batch size is one of the hyper-parameters used in the training model, and it is dependent on and is limited by the target machine memory capacity because the batch size can only fit into the remaining memory after the model is uploaded. Moreover, the data item size is also an important factor because if each data item size is larger then the batch size that can fit into the remaining memory becomes smaller. This paper proposes a framework called Micro-Batch Streaming (MBS) to address this problem. This method helps deep learning models to train by providing a batch streaming method that splits a batch into a size that can fit in the remaining memory and streams them sequentially. A loss normalization algorithm based on the gradient accumulation is used to maintain the performance. The purpose of our method is to allow deep learning models to train using larger batch sizes that exceed the memory capacity of a system without increasing the memory size or using multiple devices (GPUs).
translated by 谷歌翻译
一致性损失在解决新监督学习研究中的问题方面发挥了关键作用。然而,具有一致性损失的现存研究仅限于其对分类任务的应用;关于半监督语义分割的现存研究依赖于像素明智的分类,这不反映预测中特征的结构化性质。我们提出了一个结构化的一致性损失,以解决现存研究的这种限制。结构化的一致性损失促进了教师和学生网络之间的像素间相似性的一致性。具体而言,与Cutmix的协作通过降低计算负担急剧性地,优化了半监控语义分割的高效性能,通过降低计算负担。建议方法的优越性通过城市展开核实; Citycapes通过验证和测试数据的基准结果分别为81.9 miou和83.84 miou。这在CityCapes基准套件的像素级语义标记任务中排名第一。据我们所知,我们是第一个在语义细分中展示最先进的半监督学习的优势。
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
Cellular automata (CA) captivate researchers due to teh emergent, complex individualized behavior that simple global rules of interaction enact. Recent advances in the field have combined CA with convolutional neural networks to achieve self-regenerating images. This new branch of CA is called neural cellular automata [1]. The goal of this project is to use the idea of idea of neural cellular automata to grow prediction machines. We place many different convolutional neural networks in a grid. Each conv net cell outputs a prediction of what the next state will be, and minimizes predictive error. Cells received their neighbors' colors and fitnesses as input. Each cell's fitness score described how accurate its predictions were. Cells could also move to explore their environment and some stochasticity was applied to movement.
translated by 谷歌翻译
There is a dramatic shortage of skilled labor for modern vineyards. The Vinum project is developing a mobile robotic solution to autonomously navigate through vineyards for winter grapevine pruning. This necessitates an autonomous navigation stack for the robot pruning a vineyard. The Vinum project is using the quadruped robot HyQReal. This paper introduces an architecture for a quadruped robot to autonomously move through a vineyard by identifying and approaching grapevines for pruning. The higher level control is a state machine switching between searching for destination positions, autonomously navigating towards those locations, and stopping for the robot to complete a task. The destination points are determined by identifying grapevine trunks using instance segmentation from a Mask Region-Based Convolutional Neural Network (Mask-RCNN). These detections are sent through a filter to avoid redundancy and remove noisy detections. The combination of these features is the basis for the proposed architecture.
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译
In this paper, we learn a diffusion model to generate 3D data on a scene-scale. Specifically, our model crafts a 3D scene consisting of multiple objects, while recent diffusion research has focused on a single object. To realize our goal, we represent a scene with discrete class labels, i.e., categorical distribution, to assign multiple objects into semantic categories. Thus, we extend discrete diffusion models to learn scene-scale categorical distributions. In addition, we validate that a latent diffusion model can reduce computation costs for training and deploying. To the best of our knowledge, our work is the first to apply discrete and latent diffusion for 3D categorical data on a scene-scale. We further propose to perform semantic scene completion (SSC) by learning a conditional distribution using our diffusion model, where the condition is a partial observation in a sparse point cloud. In experiments, we empirically show that our diffusion models not only generate reasonable scenes, but also perform the scene completion task better than a discriminative model. Our code and models are available at https://github.com/zoomin-lee/scene-scale-diffusion
translated by 谷歌翻译