智能论文笔记

Revisiting Whole-Slide Image Pyramids for Cancer Prognosis via Dual-Stream Networks

Pei Liu , Bo Fu , Feng Ye , Rui Yang , Bin Xu , Luping Ji

分类：计算机视觉 | 机器学习

2022-06-12

Gigapixel全斜面图像（WSIS）上的癌症预后一直是一项艰巨的任务。大多数现有方法仅着眼于单分辨率图像。利用图像金字塔增强WSI视觉表示的多分辨率方案尚未得到足够的关注。为了探索用于提高癌症预后准确性的多分辨率解决方案，本文提出了双流构建结构，以通过图像金字塔策略对WSI进行建模。该体系结构由两个子流组成：一个是用于低分辨率WSIS，另一个是针对高分辨率的WSIS。与其他方法相比，我们的方案具有三个亮点：（i）流和分辨率之间存在一对一的关系；（ii）添加了一个平方池层以对齐两个分辨率流的斑块，从而大大降低了计算成本并启用自然流特征融合；（iii）提出了一种基于跨注意的方法，以在低分辨率的指导下在空间上在空间上进行高分辨率斑块。我们验证了三个公共可用数据集的计划，来自1,911名患者的总数为3,101个WSI。实验结果验证（1）层次双流表示比单流的癌症预后更有效，在单个低分辨率和高分辨率流中，平均C-指数上升为5.0％和1.8％ ; （2）我们的双流方案可以胜过当前最新方案，而C-Index的平均平均值为5.1％；（3）具有可观察到的生存差异的癌症疾病可能对模型复杂性具有不同的偏好。我们的计划可以作为进一步促进WSI预后研究的替代工具。

translated by 谷歌翻译

AdvMIL: Adversarial Multiple Instance Learning for the Survival Analysis on Whole-Slide Images

Pei Liu , Luping Ji , Feng Ye , Bo Fu

分类：计算机视觉

2022-12-13

The survival analysis on histological whole-slide images (WSIs) is one of the most important means to estimate patient prognosis. Although many weakly-supervised deep learning models have been developed for gigapixel WSIs, their potential is generally restricted by classical survival analysis rules and fully-supervision requirements. As a result, these models provide patients only with a completely-certain point estimation of time-to-event, and they could only learn from the well-annotated WSI data currently at a small scale. To tackle these problems, we propose a novel adversarial multiple instance learning (AdvMIL) framework. This framework is based on adversarial time-to-event modeling, and it integrates the multiple instance learning (MIL) that is much necessary for WSI representation learning. It is a plug-and-play one, so that most existing WSI-based models with embedding-level MIL networks can be easily upgraded by applying this framework, gaining the improved ability of survival distribution estimation and semi-supervised learning. Our extensive experiments show that AdvMIL could not only bring performance improvement to mainstream WSI models at a relatively low computational cost, but also enable these models to learn from unlabeled data with semi-supervised learning. Our AdvMIL framework could promote the research of time-to-event modeling in computational pathology with its novel paradigm of adversarial MIL.

translated by 谷歌翻译

Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics

Chunyuan Li , Xinliang Zhu , Jiawen Yao , Junzhou Huang

分类：计算机视觉 | 机器学习

2022-11-29

Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. Previous studies employ multiple instance learning (MIL) to represent WSIs as bags of sampled patches because, for most occasions, only slide-level labels are available, and only a tiny region of the WSI is disease-positive area. However, WSI representation learning still remains an open problem due to: (1) patch sampling on a higher resolution may be incapable of depicting microenvironment information such as the relative position between the tumor cells and surrounding tissues, while patches at lower resolution lose the fine-grained detail; (2) extracting patches from giant WSI results in large bag size, which tremendously increases the computational cost. To solve the problems, this paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Precisely, we randomly extract instant-level patch features from WSIs with different magnification. Then a co-attention mapping between imaging and genomics is learned to uncover the pairwise interaction and reduce the space complexity of imaging features. Such early fusion makes it computationally feasible to use MIL Transformer for the survival prediction task. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability. We evaluate our approach on five cancer types from the Cancer Genome Atlas database and achieved an average c-index of $0.673$, outperforming the state-of-the-art multimodality methods.

translated by 谷歌翻译

ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification

Thomas Stegmüller , Behzad Bozorgtabar , Antoine Spahr , Jean-Philippe Thiran

分类：计算机视觉

2022-02-15

高分辨率图像和详尽的局部注释成本的过高成本阻碍了数字病理学的进展。用于对病理图像进行分类的常用范式是基于贴片的处理，该处理通常结合了多个实例学习（MIL）以汇总局部补丁级表示，从而得出图像级预测。尽管如此，诊断相关的区域只能占整个组织的一小部分，而当前基于MIL的方法通常会均匀地处理图像，从而丢弃相互作用的相互作用。为了减轻这些问题，我们提出了Scorenet，Scorenet是一种新的有效的变压器，利用可区分的建议阶段来提取区分图像区域并相应地专用计算资源。提出的变压器利用一些动态推荐的高分辨率区域的本地和全球关注，以有效的计算成本。我们通过利用图像的语义分布来指导数据混合并产生连贯的样品标签对，进一步介绍了一种新型的混合数据启发，即SCOREX。 SCOREMIX令人尴尬地简单，并减轻了先前的增强的陷阱，该增强性的陷阱假设了统一的语义分布，并冒着标签样品的风险。对血久毒素和曙红（H＆E）的三个乳腺癌组织学数据集（H＆E）的三个乳腺癌组织学数据集（H＆E）的彻底实验和消融研究验证了我们的方法优于先前的艺术，包括基于变压器的肿瘤区域（TORIS）分类的模型。与其他混合增强变体相比，配备了拟议的得分增强的Scorenet表现出更好的概括能力，并实现了新的最先进的结果（SOTA）结果，仅50％的数据。最后，Scorenet产生了高疗效，并且胜过SOTA有效变压器，即TransPath和SwintransFormer。

translated by 谷歌翻译

Handcrafted Histological Transformer (H2T): Unsupervised Representation of Whole Slide Images

Quoc Dang Vu , Kashif Rajpoot , Shan E Ahmed Raza , Nasir Rajpoot

分类：计算机视觉

2022-02-14

病理诊所中癌症的诊断，预后和治疗性决策现在可以基于对多吉吉像素组织图像的分析，也称为全斜图像（WSIS）。最近，已经提出了深层卷积神经网络（CNN）来得出无监督的WSI表示。这些很有吸引力，因为它们不太依赖于繁琐的专家注释。但是，一个主要的权衡是，较高的预测能力通常以解释性为代价，这对他们的临床使用构成了挑战，通常通常期望决策中的透明度。为了应对这一挑战，我们提出了一个基于Deep CNN的手工制作的框架，用于构建整体WSI级表示。基于有关变压器在自然语言处理领域的内部工作的最新发现，我们将其过程分解为一个更透明的框架，我们称其为手工制作的组织学变压器或H2T。基于我们涉及各种数据集的实验，包括总共5,306个WSI，结果表明，与最近的最新方法相比，基于H2T的整体WSI级表示具有竞争性能，并且可以轻松用于各种下游分析任务。最后，我们的结果表明，H2T框架的最大14倍，比变压器模型快14倍。

translated by 谷歌翻译

A Survey of Visual Transformers

Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang , Zhongchao Shi , Jianping Fan , Zhiqiang He

分类：计算机视觉

2021-11-11

变压器是一种基于关注的编码器解码器架构，彻底改变了自然语言处理领域。灵感来自这一重大成就，最近在将变形式架构调整到计算机视觉（CV）领域的一些开创性作品，这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力，与现代卷积神经网络相比在本文中，我们已经为三百不同的视觉变压器进行了全面的审查，用于三个基本的CV任务（分类，检测和分割），提出了根据其动机，结构和使用情况组织这些方法的分类。。由于培训设置和面向任务的差异，我们还在不同的配置上进行了评估了这些方法，以便于易于和直观的比较而不是各种基准。此外，我们已经揭示了一系列必不可少的，但可能使变压器能够从众多架构中脱颖而出，例如松弛的高级语义嵌入，以弥合视觉和顺序变压器之间的差距。最后，提出了三个未来的未来研究方向进行进一步投资。

translated by 谷歌翻译

Transformers in Medical Image Analysis: A Review

Kelei He , Chen Gan , Zhuoyuan Li , Islem Rekik , Zihao Yin , Wen Ji , Yang Gao , Qian Wang , Junfeng Zhang , Dinggang Shen

分类：计算机视觉

2022-02-24

变形金刚占据了自然语言处理领域，最近影响了计算机视觉区域。在医学图像分析领域中，变压器也已成功应用于全栈临床应用，包括图像合成/重建，注册，分割，检测和诊断。我们的论文旨在促进变压器在医学图像分析领域的认识和应用。具体而言，我们首先概述了内置在变压器和其他基本组件中的注意机制的核心概念。其次，我们回顾了针对医疗图像应用程序量身定制的各种变压器体系结构，并讨论其局限性。在这篇综述中，我们调查了围绕在不同学习范式中使用变压器，提高模型效率及其与其他技术的耦合的关键挑战。我们希望这篇评论可以为读者提供医学图像分析领域的读者的全面图片。

translated by 谷歌翻译

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

Zhuchen Shao , Hao Bian , Yang Chen , Yifeng Wang , Jian Zhang , Xiangyang Ji , Yongbing Zhang

分类：计算机视觉

2021-06-02

多实例学习（MIL）是一种强大的工具，可以解决基于整个滑动图像（WSI）的病理学诊断中的弱监督分类。然而，目前的MIL方法通常基于独立和相同的分布假设，从而忽略不同实例之间的相关性。为了解决这个问题，我们提出了一个被称为相关的MIL的新框架，并提供了融合证明。基于此框架，我们设计了一种基于变压器的MIL（TMARMIL），其探讨了形态和空间信息。所提出的传输可以有效地处理不平衡/平衡和二元/多重分类，具有良好的可视化和可解释性。我们对三种不同的计算病理问题进行了各种实验，与最先进的方法相比，实现了更好的性能和更快的会聚。在CAMELYON16数据集中的二进制肿瘤分类的测试AUC最高可达93.09％。在TCGA-NSCLC数据集和TCGA-RCC数据集中，癌症亚型分类的AUC分别可以高达96.03％和98.82％。实现可用于：https://github.com/szc19990412/transmil。

translated by 谷歌翻译

Multi-Scale Relational Graph Convolutional Network for Multiple Instance Learning in Histopathology Images

Roozbeh Bazargani , Ladan Fazli , Larry Goldenberg , Martin Gleave , Ali Bashashati , Septimiu Salcudean

分类：计算机视觉

2022-12-17

Graph convolutional neural networks have shown significant potential in natural and histopathology images. However, their use has only been studied in a single magnification or multi-magnification with late fusion. In order to leverage the multi-magnification information and early fusion with graph convolutional networks, we handle different embedding spaces at each magnification by introducing the Multi-Scale Relational Graph Convolutional Network (MS-RGCN) as a multiple instance learning method. We model histopathology image patches and their relation with neighboring patches and patches at other scales (i.e., magnifications) as a graph. To pass the information between different magnification embedding spaces, we define separate message-passing neural networks based on the node and edge type. We experiment on prostate cancer histopathology images to predict the grade groups based on the extracted features from patches. We also compare our MS-RGCN with multiple state-of-the-art methods with evaluations on both source and held-out datasets. Our method outperforms the state-of-the-art on both datasets and especially on the classification of grade groups 2 and 3, which are significant for clinical decisions for patient management. Through an ablation study, we test and show the value of the pertinent design features of the MS-RGCN.

translated by 谷歌翻译

Incorporating intratumoral heterogeneity into weakly-supervised deep learning models via variance pooling

Iain Carmichael , Andrew H. Song , Richard J. Chen , Drew F. K. Williamson , Tiffany Y. Chen , Faisal Mahmood

分类：计算机视觉 | 机器学习

2022-06-17

监督的学习任务，例如GigaiPixel全幻灯片图像（WSIS）等癌症存活预测是计算病理学中的关键挑战，需要对肿瘤微环境的复杂特征进行建模。这些学习任务通常通过不明确捕获肿瘤内异质性的深层多企业学习（MIL）模型来解决。我们开发了一种新颖的差异池体系结构，使MIL模型能够将肿瘤内异质性纳入其预测中。说明了基于代表性补丁的两个可解释性工具，以探测这些模型捕获的生物学信号。一项针对癌症基因组图集的4,479吉普像素WSI的实证研究表明，在MIL框架上增加方差汇总可改善五种癌症类型的生存预测性能。

translated by 谷歌翻译

Multiplex-detection Based Multiple Instance Learning Network for Whole Slide Image Classification

Zhikang Wang , Yue Bi , Tong Pan , Chris Bain , Richard Bassed , Seiya Imoto , Jianhua Yao , Jiangning Song

分类：计算机视觉

2022-08-06

多个实例学习（MIL）是对诊断病理学的整个幻灯片图像（WSI）进行分类的强大方法。 MIL对WSI分类的基本挑战是发现触发袋子标签的\ textit {critical Instances}。但是，先前的方法主要是在独立和相同的分布假设（\ textit {i.i.d}）下设计的，忽略了肿瘤实例或异质性之间的相关性。在本文中，我们提出了一种新颖的基于多重检测的多重实例学习（MDMIL）来解决上述问题。具体而言，MDMIL是由内部查询产生模块（IQGM）和多重检测模块（MDM）构建的，并在训练过程中基于内存的对比度损失的辅助。首先，IQGM给出了实例的概率，并通过在分布分析后汇总高度可靠的功能来为后续MDM生成内部查询（IQ）。其次，在MDM中，多重检测交叉注意（MDCA）和多头自我注意力（MHSA）合作以生成WSI的最终表示形式。在此过程中，智商和可训练的变异查询（VQ）成功建立了实例之间的联系，并显着提高了模型对异质肿瘤的鲁棒性。最后，为了进一步在特征空间中实施限制并稳定训练过程，我们采用基于内存的对比损失，即使在每次迭代中有一个样本作为输入，也可以实现WSI分类。我们对三个计算病理数据集进行实验，例如CamelyOn16，TCGA-NSCLC和TCGA-RCC数据集。优越的准确性和AUC证明了我们提出的MDMIL比其他最先进方法的优越性。

translated by 谷歌翻译

A Data-scalable Transformer for Medical Image Segmentation: Architecture, Model Efficiency, and Benchmark

Yunhe Gao , Mu Zhou , Di Liu , Zhennan Yan , Shaoting Zhang , Dimitris N. Metaxas

分类：计算机视觉

2022-02-28

作为新一代神经体系结构的变形金刚在自然语言处理和计算机视觉方面表现出色。但是，现有的视觉变形金刚努力使用有限的医学数据学习，并且无法概括各种医学图像任务。为了应对这些挑战，我们将Medformer作为数据量表变压器呈现为可推广的医学图像分割。关键设计结合了理想的电感偏差，线性复杂性的层次建模以及以空间和语义全局方式以线性复杂性的关注以及多尺度特征融合。 Medformer可以在不预训练的情况下学习微小至大规模的数据。广泛的实验表明，Medformer作为一般分割主链的潜力，在三个具有多种模式（例如CT和MRI）和多样化的医学靶标（例如，健康器官，疾病，疾病组织和肿瘤）的三个公共数据集上优于CNN和视觉变压器。我们将模型和评估管道公开可用，为促进广泛的下游临床应用提供固体基线和无偏比较。

translated by 谷歌翻译

Colorectal cancer survival prediction using deep distribution based multiple-instance learning

Xingyu Li , Jitendra Jonnagaddala , Min Cen , Hong Zhang , Xu Steven Xu

分类：计算机视觉

2022-04-24

已经开发了几种深度学习算法，以使用整个幻灯片图像（WSIS）预测癌症患者的存活。但是，WSI中与患者的生存和疾病进展有关的WSI中的图像表型对临床医生而言都是困难的，以及深度学习算法。用于生存预测的大多数基于深度学习的多个实例学习（MIL）算法使用顶级实例（例如Maxpooling）或顶级/底部实例（例如，Mesonet）来识别图像表型。在这项研究中，我们假设WSI中斑块得分分布的全面信息可以更好地预测癌症的生存。我们开发了一种基于分布的多构度生存学习算法（DeepDismisl）来验证这一假设。我们使用两个大型国际大型癌症WSIS数据集设计和执行实验-MCO CRC和TCGA Coad -Read。我们的结果表明，有关WSI贴片分数的分布的信息越多，预测性能越好。包括每个选定分配位置（例如百分位数）周围的多个邻域实例可以进一步改善预测。与最近发表的最新算法相比，DeepDismisl具有优越的预测能力。此外，我们的算法是可以解释的，可以帮助理解癌症形态表型与癌症生存风险之间的关系。

translated by 谷歌翻译

DQnet: Cross-Model Detail Querying for Camouflaged Object Detection

Wei Sun , Chengao Liu , Linyan Zhang , Yu Li , Pengxu Wei , Chang Liu , Jialing Zou , Jianbin Jiao , Qixiang Ye

分类：计算机视觉

2022-12-16

Camouflaged objects are seamlessly blended in with their surroundings, which brings a challenging detection task in computer vision. Optimizing a convolutional neural network (CNN) for camouflaged object detection (COD) tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue which inevitably leads to missing or redundant regions of objects. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among image regions. In order to obtain feature maps that could activate full object extent, keeping the segmental results from being overwhelmed by noisy features, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed. It reasons the relations between long-range-aware representations and multi-scale local details to make the enhanced representation fully highlight the object regions and eliminate noise on non-object regions. Specifically, a vanilla ViT pretrained with self-supervised learning (SSL) is employed to model long-range dependencies among image regions. A ResNet is employed to enable learning fine-grained spatial local details in multiple scales. Then, to effectively retrieve object-related details, a Relation-Based Querying (RBQ) module is proposed to explore window-based interactions between the global representations and the multi-scale local details. Extensive experiments are conducted on the widely used COD datasets and show that our DQnet outperforms the current state-of-the-arts.

translated by 谷歌翻译

Representation Separation for Semantic Segmentation with Vision Transformers

Yuanduo Hong , Huihui Pan , Weichao Sun , Xinghu Yu , Huijun Gao

分类：计算机视觉 | 人工智能

2022-12-28

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.

translated by 谷歌翻译

Transformers in Vision: A Survey

Salman Khan , Muzammal Naseer , Munawar Hayat , Syed Waqas Zamir , Fahad Shahbaz Khan , Mubarak Shah

分类：

2021-01-04

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

translated by 谷歌翻译

MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer

Tianyi Zhang , Yunlu Feng , Yu Zhao , Guangda Fan , Aiming Yang , Shangqin Lyu , Peng Zhang , Fan Song , Chenbin Ma , Yangyang Sun

分类：计算机视觉 | 机器学习

2021-12-27

胰腺癌是世界上最严重恶性的癌症之一，这种癌症迅速迅速，具有很高的死亡率。快速的现场评估（玫瑰）技术通过立即分析与现场病理学家的快速染色的细胞影析学形象来创新工作流程，这使得在这种紧压的过程中能够更快的诊断。然而，由于缺乏经验丰富的病理学家，玫瑰诊断的更广泛的扩张已经受到阻碍。为了克服这个问题，我们提出了一个混合高性能深度学习模型，以实现自动化工作流程，从而释放占据病理学家的宝贵时间。通过使用我们特定的多级混合设计将变压器块引入该字段，由卷积神经网络（CNN）产生的空间特征显着增强了变压器全球建模。转向多级空间特征作为全球关注指导，这种设计将鲁棒性与CNN的感应偏差与变压器的复杂全球建模功能相结合。收集4240朵Rose图像的数据集以评估此未开发领域的方法。所提出的多级混合变压器（MSHT）在分类精度下实现95.68％，其鲜明地高于最先进的模型。面对对可解释性的需求，MSHT以更准确的关注区域表达其对应物。结果表明，MSHT可以以前所未有的图像规模精确地区分癌症样本，奠定了部署自动决策系统的基础，并在临床实践中扩大玫瑰。代码和记录可在：https://github.com/sagizty/multi-stage-ybrid-transformer。

translated by 谷歌翻译

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Chen , Quanfu Fan , Rameswar Panda

分类：

2021-03-27

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at https://github.com/IBM/CrossViT.

translated by 谷歌翻译

Fully Transformer Networks for Semantic Image Segmentation

Sitong Wu , Tianyi Wu , Fangjian Lin , Shengwei Tian , Guodong Guo

分类：计算机视觉

2021-06-08

由于长距离依赖性建模的能力，变压器在各种自然语言处理和计算机视觉任务中表现出令人印象深刻的性能。最近的进展证明，将这种变压器与基于CNN的语义图像分割模型相结合非常有前途。然而，目前还没有很好地研究了纯变压器的方法如何实现图像分割。在这项工作中，我们探索了语义图像分割的新框架，它是基于编码器 - 解码器的完全变压器网络（FTN）。具体地，我们首先提出金字塔组变压器（PGT）作为逐步学习分层特征的编码器，同时降低标准视觉变压器（VIT）的计算复杂性。然后，我们将特征金字塔变换器（FPT）提出了来自PGT编码器的多电平进行语义图像分割的多级别的语义级别和空间级信息。令人惊讶的是，这种简单的基线可以在多个具有挑战性的语义细分和面部解析基准上实现更好的结果，包括帕斯卡背景，ADE20K，Cocostuff和Celebamask-HQ。源代码将在https://github.com/br -dl/paddlevit上发布。

translated by 谷歌翻译

CSformer: Bridging Convolution and Transformer for Compressive Sensing

Dongjie Ye , Zhangkai Ni , Hanli Wang , Jian Zhang , Shiqi Wang , Sam Kwong

分类：计算机视觉

2021-12-31

卷积神经网络（CNNS）成功地进行了压缩图像感测。然而，由于局部性和重量共享的归纳偏差，卷积操作证明了建模远程依赖性的内在限制。变压器，最初作为序列到序列模型设计，在捕获由于基于自我关注的架构而捕获的全局背景中，即使它可以配备有限的本地化能力。本文提出了一种混合框架，一个混合框架，其集成了从CNN提供的借用的优点以及变压器提供的全局上下文，以获得增强的表示学习。所提出的方法是由自适应采样和恢复组成的端到端压缩图像感测方法。在采样模块中，通过学习的采样矩阵测量图像逐块。在重建阶段，将测量投射到双杆中。一个是用于通过卷积建模邻域关系的CNN杆，另一个是用于采用全球自我关注机制的变压器杆。双分支结构是并发，并且本地特征和全局表示在不同的分辨率下融合，以最大化功能的互补性。此外，我们探索一个渐进的战略和基于窗口的变压器块，以降低参数和计算复杂性。实验结果表明了基于专用变压器的架构进行压缩感测的有效性，与不同数据集的最先进方法相比，实现了卓越的性能。

translated by 谷歌翻译