Numerous deep learning applications benefit from multitask learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
translated by 谷歌翻译
We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of taskspecific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-toimage predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/ lorenmt/mtan.
translated by 谷歌翻译
There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.
translated by 谷歌翻译
我们介绍了MGNET,这是一个多任务框架,用于单眼几何场景。我们将单眼几何场景的理解定义为两个已知任务的组合:全景分割和自我监管的单眼深度估计。全景分段不仅在语义上,而且在实例的基础上捕获完整场景。自我监督的单眼深度估计使用摄像机测量模型得出的几何约束,以便从单眼视频序列中测量深度。据我们所知,我们是第一个在一个模型中提出这两个任务的组合的人。我们的模型专注于低潜伏期,以实时在单个消费级GPU上实时提供快速推断。在部署过程中,我们的模型将产生密集的3D点云,其中具有来自单个高分辨率摄像头图像的实例意识到语义标签。我们对两个流行的自动驾驶基准(即CityScapes and Kitti)评估了模型,并在其他能够实时的方法中表现出竞争性能。源代码可从https://github.com/markusschoen/mgnet获得。
translated by 谷歌翻译
我们提出了自我监督单眼深度估计(SDE)的通用多任务培训框架。深入培训的深度模型,具有在标准单任务SDE框架中培训的相同型号。通过将额外的自蒸馏任务引入标准的SDE训练框架,低置训练深度网络,不仅可以预测图像重建任务的深度图,而且还用于从培训的教师网络蒸馏出具有未标记数据的知识。为了利用这种多任务设置,我们为每个任务提出了同性恋的不确定性配方,以惩罚可能受教师网络噪声影响的区域,或违反SDE假设。我们对Kitti提供了广泛的评估,以展示使用拟议框架培训一系列现有网络实现的改进,我们在此任务上实现了最先进的表现。此外,子深度使模型能够估计深度输出的不确定性。
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
多任务学习(MTL)是深度学习中的一个活动字段,其中我们通过利用任务之间的关系来共同学习多项任务。已经证明,与独立学习每个任务时,MTL有助于该模型共享任务之间的学习功能并增强预测。我们为2任务MTL问题提出了一个新的学习框架,它使用一个任务的预测作为另一个网络的输入来预测其他任务。我们定义了由循环一致性损失和对比学习,对齐和跨任务一致性损失的两个新的损失术语。这两个损耗都旨在实施模型以对准多个任务的预测,以便模型一致地预测。理论上我们证明,两次损失都帮助模型更有效地学习,并且在与直接预测的对齐方面更好地了解跨任务一致性损失。实验结果还表明,我们的拟议模型在基准城市景观和NYU数据集上实现了显着性能。
translated by 谷歌翻译
We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.
translated by 谷歌翻译
多任务学习最近已成为对复杂场景的全面理解的有前途的解决方案。不仅具有适当设计的记忆效率,多任务模型都可以跨任务交换互补信号。在这项工作中,我们共同解决了2D语义分割,以及两个与几何相关的任务,即密集的深度,表面正常估计以及边缘估计,显示了它们对室内和室外数据集的好处。我们提出了一种新颖的多任务学习体系结构,该体系结构通过相关引导的注意力和自我注意力来利用配对的交叉任务交换,以增强所有任务的平均表示学习。我们考虑了三个多任务设置的广泛实验,与合成基准和真实基准中的竞争基准相比,我们的提案的好处。我们还将方法扩展到新型的多任务无监督域的适应设置。我们的代码可在https://github.com/cv-rits/densemtl上找到。
translated by 谷歌翻译
Jaccard索引,也称为交叉联盟(iou),是图像语义分段中最关键的评估度量之一。然而,由于学习目的既不可分解也不是可分解的,则iou得分的直接优化是非常困难的。虽然已经提出了一些算法来优化其代理,但没有提供泛化能力的保证。在本文中,我们提出了一种边缘校准方法,可以直接用作学习目标,在数据分布上改善IOO的推广,通过刚性下限为基础。本方案理论上,根据IOU分数来确保更好的分割性能。我们评估了在七个图像数据集中所提出的边缘校准方法的有效性,显示使用深度分割模型的其他学习目标的IOU分数大量改进。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
Deep multitask networks, in which one neural network produces multiple predictive outputs, can offer better speed and performance than their single-task counterparts but are challenging to train properly. We present a gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes. We show that for various network architectures, for both regression and classification tasks, and on both synthetic and real datasets, GradNorm improves accuracy and reduces overfitting across multiple tasks when compared to single-task networks, static baselines, and other adaptive multitask loss balancing techniques. GradNorm also matches or surpasses the performance of exhaustive grid search methods, despite only involving a single asymmetry hyperparameter α. Thus, what was once a tedious search process that incurred exponentially more compute for each task added can now be accomplished within a few training runs, irrespective of the number of tasks. Ultimately, we will demonstrate that gradient manipulation affords us great control over the training dynamics of multitask networks and may be one of the keys to unlocking the potential of multitask learning.
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译
对于现代自治系统来说,可靠的场景理解是必不可少的。当前基于学习的方法通常试图根据仅考虑分割质量的细分指标来最大化其性能。但是,对于系统在现实世界中的安全操作,考虑预测的不确定性也至关重要。在这项工作中,我们介绍了不确定性感知的全景分段的新任务,该任务旨在预测每个像素语义和实例分割,以及每个像素不确定性估计。我们定义了两个新颖的指标,以促进其定量分析,不确定性感知的综合质量(UPQ)和全景预期校准误差(PECE)。我们进一步提出了新型的自上而下的证据分割网络(EVPSNET),以解决此任务。我们的架构采用了一个简单而有效的概率融合模块,该模块利用了预测的不确定性。此外,我们提出了一种新的LOV \'ASZ证据损失函数,以优化使用深度证据学习概率的分割的IOU。此外,我们提供了几个强大的基线,将最新的泛型分割网络与无抽样的不确定性估计技术相结合。广泛的评估表明,我们的EVPSNET可以实现标准综合质量(PQ)的新最新技术,以及我们的不确定性倾斜度指标。
translated by 谷歌翻译
我们提出了一个统一的查看,即通过通用表示,一个深层神经网络共同学习多个视觉任务和视觉域。同时学习多个问题涉及最大程度地减少具有不同幅度和特征的多个损失函数的加权总和,从而导致一个损失的不平衡状态,与学习每个问题的单独模型相比,一个损失的不平衡状态主导了优化和差的结果。为此,我们提出了通过小容量适配器将多个任务/特定于域网络的知识提炼到单个深神经网络中的知识。我们严格地表明,通用表示在学习NYU-V2和CityScapes中多个密集的预测问题方面实现了最新的表现,来自视觉Decathlon数据集中的不同域中的多个图像分类问题以及MetadataSet中的跨域中的几个域中学习。最后,我们还通过消融和定性研究进行多次分析。
translated by 谷歌翻译
Semantic segmentation is a classic computer vision problem dedicated to labeling each pixel with its corresponding category. As a basic task for advanced tasks such as industrial quality inspection, remote sensing information extraction, medical diagnostic aid, and autonomous driving, semantic segmentation has been developed for a long time in combination with deep learning, and a lot of works have been accumulated. However, neither the classic FCN-based works nor the popular Transformer-based works have attained fine-grained localization of pixel labels, which remains the main challenge in this field. Recently, with the popularity of autonomous driving, the segmentation of road scenes has received increasing attention. Based on the cross-task consistency theory, we incorporate edge priors into semantic segmentation tasks to obtain better results. The main contribution is that we provide a model-agnostic method that improves the accuracy of semantic segmentation models with zero extra inference runtime overhead, verified on the datasets of road and non-road scenes. From our experimental results, our method can effectively improve semantic segmentation accuracy.
translated by 谷歌翻译
尽管最近的密集预测问题的多任务学习的进步,但大多数方法都依赖于昂贵的标记数据集。在本文中,我们介绍了一个标签有效的方法,并在部分注释的数据上关注多密集预测任务,我们称之为多任务部分监督学习。我们提出了一种多任务培训程序,该程序成功利用任务关系在数据部分注释时监督其多任务学习。特别地,我们学会将每个任务对映射到联合成对任务空间,这使得通过在任务对上的另一个网络通过另一个网络以计算有效的方式共享信息,并通过保留高级信息来避免学习琐碎的交叉任务关系关于输入图像。我们严格证明,我们的提出方法有效利用了未标记的任务的图像,并且在三个标准基准测试中优于现有的半监督学习方法和相关方法。
translated by 谷歌翻译
In multi-task learning, multiple tasks are solved jointly, sharing inductive bias between them. Multi-task learning is inherently a multi-objective problem because different tasks may conflict, necessitating a trade-off. A common compromise is to optimize a proxy objective that minimizes a weighted linear combination of pertask losses. However, this workaround is only valid when the tasks do not compete, which is rarely the case. In this paper, we explicitly cast multi-task learning as multi-objective optimization, with the overall objective of finding a Pareto optimal solution. To this end, we use algorithms developed in the gradient-based multiobjective optimization literature. These algorithms are not directly applicable to large-scale learning problems since they scale poorly with the dimensionality of the gradients and the number of tasks. We therefore propose an upper bound for the multi-objective loss and show that it can be optimized efficiently. We further prove that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions. We apply our method to a variety of multi-task deep learning problems including digit classification, scene understanding (joint semantic segmentation, instance segmentation, and depth estimation), and multilabel classification. Our method produces higher-performing models than recent multi-task learning formulations or per-task training.
translated by 谷歌翻译
视频分析的图像分割在不同的研究领域起着重要作用,例如智能城市,医疗保健,计算机视觉和地球科学以及遥感应用。在这方面,最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地,目前正在研究Panoptic细分,以帮助获得更多对视频监控,人群计数,自主驾驶,医学图像分析的图像场景的更细致的知识,以及一般对场景更深入的了解。为此,我们介绍了本文的首次全面审查现有的Panoptic分段方法,以获得作者的知识。因此,基于所采用的算法,应用场景和主要目标的性质,执行现有的Panoptic技术的明确定义分类。此外,讨论了使用伪标签注释新数据集的Panoptic分割。继续前进,进行消融研究,以了解不同观点的Panoptic方法。此外,讨论了适合于Panoptic分割的评估度量,并提供了现有解决方案性能的比较,以告知最先进的并识别其局限性和优势。最后,目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势,可以成为即将到来的研究研究的起点。提供代码的文件可用于:https://github.com/elharroussomar/awesome-panoptic-egation
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage.We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译