智能论文笔记

RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment

Yiting Lu , Jun Fu , Xin Li , Wei Zhou , Sen Liu , Xinxin Zhang , Congfu Jia , Ying Liu , Zhibo Chen

分类：计算机视觉

2022-07-13

冠状动脉血管造影（CCTA）易受各种扭曲（例如伪影和噪声）的敏感，这严重损害了心血管疾病的确切诊断。适当的CCTA血管级图像质量评估（CCTA VIQA）算法可用于降低错误诊断的风险。 CCTA VIQA的首要挑战是，冠状动脉的本地部分确定最终质量是很难找到的。为了应对挑战，我们将CCTA VIQA作为多种现实学习（MIL）问题，并利用基于变压器的MIL主链（称为T-MIL），以将沿冠状动脉中心线的多个实例汇总为最终质量。但是，并非所有实例都提供最终质量的信息。有一些质量 - 欧元/负面实例介入确切的质量评估（例如，在实例中仅涵盖背景或冠状动脉的实例是无法识别的）。因此，我们提出了一个基于渐进的增强学习的实例丢弃模块（称为PRID），以逐步删除CCTA VIQA的质量 - 欧尔特尔/否定实例。基于上述两个模块，我们根据端到端优化提出了一个加强的变压器网络（RTN），用于自动CCTA VIQA。广泛的实验结果表明，我们提出的方法实现了现实世界中CCTA数据集的最新性能，超过了以前的MIL方法。

translated by 谷歌翻译

Multiplex-detection Based Multiple Instance Learning Network for Whole Slide Image Classification

Zhikang Wang , Yue Bi , Tong Pan , Chris Bain , Richard Bassed , Seiya Imoto , Jianhua Yao , Jiangning Song

分类：计算机视觉

2022-08-06

多个实例学习（MIL）是对诊断病理学的整个幻灯片图像（WSI）进行分类的强大方法。 MIL对WSI分类的基本挑战是发现触发袋子标签的\ textit {critical Instances}。但是，先前的方法主要是在独立和相同的分布假设（\ textit {i.i.d}）下设计的，忽略了肿瘤实例或异质性之间的相关性。在本文中，我们提出了一种新颖的基于多重检测的多重实例学习（MDMIL）来解决上述问题。具体而言，MDMIL是由内部查询产生模块（IQGM）和多重检测模块（MDM）构建的，并在训练过程中基于内存的对比度损失的辅助。首先，IQGM给出了实例的概率，并通过在分布分析后汇总高度可靠的功能来为后续MDM生成内部查询（IQ）。其次，在MDM中，多重检测交叉注意（MDCA）和多头自我注意力（MHSA）合作以生成WSI的最终表示形式。在此过程中，智商和可训练的变异查询（VQ）成功建立了实例之间的联系，并显着提高了模型对异质肿瘤的鲁棒性。最后，为了进一步在特征空间中实施限制并稳定训练过程，我们采用基于内存的对比损失，即使在每次迭代中有一个样本作为输入，也可以实现WSI分类。我们对三个计算病理数据集进行实验，例如CamelyOn16，TCGA-NSCLC和TCGA-RCC数据集。优越的准确性和AUC证明了我们提出的MDMIL比其他最先进方法的优越性。

translated by 谷歌翻译

Learning Transformer Features for Image Quality Assessment

Chao Zeng , Sam Kwong

分类：计算机视觉

2021-12-01

目标图像质量评估是一个具有挑战性的任务，旨在自动测量给定图像的质量。根据参考图像的可用性，分别存在全引用和无引用IQA任务。大多数深度学习方法使用卷积神经网络提取的深度特征的回归。对于FR任务，另一种选择是对深度特征进行统计比较。对于所有这些方法，通常忽略非本地信息。此外，探索FR和NR任务之间的关系不太探索。通过最近的变压器成功在建模上下文信息中，我们提出了一个统一的IQA框架，它利用CNN骨干和变压器编码器提取特征。所提出的框架与FR和NR模式兼容，并允许联合训练方案。评估实验在三个标准IQA数据集，即LIVE，CSIQ和TID2013和KONIQ-10K上，显示我们所提出的模型可以实现最先进的FR性能。此外，在广泛的实验中实现了相当的NR性能，结果表明，联合训练方案可以利用NR性能。

translated by 谷歌翻译

RLogist: Fast Observation Strategy on Whole-slide Images with Deep Reinforcement Learning

Boxuan Zhao , Jun Zhang , Deheng Ye , Jian Cao , Xiao Han , Qiang Fu , Wei Yang

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-04

Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: \url{https://github.com/tencent-ailab/RLogist}.

translated by 谷歌翻译

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

Zhuchen Shao , Hao Bian , Yang Chen , Yifeng Wang , Jian Zhang , Xiangyang Ji , Yongbing Zhang

分类：计算机视觉

2021-06-02

多实例学习（MIL）是一种强大的工具，可以解决基于整个滑动图像（WSI）的病理学诊断中的弱监督分类。然而，目前的MIL方法通常基于独立和相同的分布假设，从而忽略不同实例之间的相关性。为了解决这个问题，我们提出了一个被称为相关的MIL的新框架，并提供了融合证明。基于此框架，我们设计了一种基于变压器的MIL（TMARMIL），其探讨了形态和空间信息。所提出的传输可以有效地处理不平衡/平衡和二元/多重分类，具有良好的可视化和可解释性。我们对三种不同的计算病理问题进行了各种实验，与最先进的方法相比，实现了更好的性能和更快的会聚。在CAMELYON16数据集中的二进制肿瘤分类的测试AUC最高可达93.09％。在TCGA-NSCLC数据集和TCGA-RCC数据集中，癌症亚型分类的AUC分别可以高达96.03％和98.82％。实现可用于：https://github.com/szc19990412/transmil。

translated by 谷歌翻译

Revisiting Whole-Slide Image Pyramids for Cancer Prognosis via Dual-Stream Networks

Pei Liu , Bo Fu , Feng Ye , Rui Yang , Bin Xu , Luping Ji

分类：计算机视觉 | 机器学习

2022-06-12

Gigapixel全斜面图像（WSIS）上的癌症预后一直是一项艰巨的任务。大多数现有方法仅着眼于单分辨率图像。利用图像金字塔增强WSI视觉表示的多分辨率方案尚未得到足够的关注。为了探索用于提高癌症预后准确性的多分辨率解决方案，本文提出了双流构建结构，以通过图像金字塔策略对WSI进行建模。该体系结构由两个子流组成：一个是用于低分辨率WSIS，另一个是针对高分辨率的WSIS。与其他方法相比，我们的方案具有三个亮点：（i）流和分辨率之间存在一对一的关系；（ii）添加了一个平方池层以对齐两个分辨率流的斑块，从而大大降低了计算成本并启用自然流特征融合；（iii）提出了一种基于跨注意的方法，以在低分辨率的指导下在空间上在空间上进行高分辨率斑块。我们验证了三个公共可用数据集的计划，来自1,911名患者的总数为3,101个WSI。实验结果验证（1）层次双流表示比单流的癌症预后更有效，在单个低分辨率和高分辨率流中，平均C-指数上升为5.0％和1.8％ ; （2）我们的双流方案可以胜过当前最新方案，而C-Index的平均平均值为5.1％；（3）具有可观察到的生存差异的癌症疾病可能对模型复杂性具有不同的偏好。我们的计划可以作为进一步促进WSI预后研究的替代工具。

translated by 谷歌翻译

Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays

Yan Han , Gregory Holste , Ying Ding , Ahmed Tewfik , Yifan Peng , Zhangyang Wang

分类：计算机视觉

2022-07-10

在深度学习方法进行自动医学图像分析的最新成功之前，从业者使用手工制作的放射线特征来定量描述当地的医学图像斑块。但是，提取区分性放射素特征取决于准确的病理定位，这在现实世界中很难获得。尽管疾病分类和胸部X射线的定位方面取得了进步，但许多方法未能纳入临床知名的领域知识。由于这些原因，我们提出了一个放射素引导的变压器（RGT），该变压器（RGT）与\ textit {global}图像信息与\ textit {local}知识引导的放射线信息信息提供准确的心肺病理学定位和分类\ textit {无需任何界限盒{ }。 RGT由图像变压器分支，放射线变压器分支以及聚集图像和放射线信息的融合层组成。 RGT使用对图像分支的自我注意事项，提取了一个边界框来计算放射线特征，该特征由放射线分支进一步处理。然后通过交叉注意层融合学习的图像和放射线特征。因此，RGT利用了一种新型的端到端反馈回路，该回路只能使用图像水平疾病标签引导精确的病理定位。 NIH CHESTXRAR数据集的实验表明，RGT的表现优于弱监督疾病定位的先前作品（在各个相交联合阈值的平均余量为3.6 \％）和分类（在接收器操作方下平均1.1 \％\％\％\％曲线）。接受代码和训练有素的模型将在接受后发布。

translated by 谷歌翻译

Transformers in Medical Image Analysis: A Review

Kelei He , Chen Gan , Zhuoyuan Li , Islem Rekik , Zihao Yin , Wen Ji , Yang Gao , Qian Wang , Junfeng Zhang , Dinggang Shen

分类：计算机视觉

2022-02-24

变形金刚占据了自然语言处理领域，最近影响了计算机视觉区域。在医学图像分析领域中，变压器也已成功应用于全栈临床应用，包括图像合成/重建，注册，分割，检测和诊断。我们的论文旨在促进变压器在医学图像分析领域的认识和应用。具体而言，我们首先概述了内置在变压器和其他基本组件中的注意机制的核心概念。其次，我们回顾了针对医疗图像应用程序量身定制的各种变压器体系结构，并讨论其局限性。在这篇综述中，我们调查了围绕在不同学习范式中使用变压器，提高模型效率及其与其他技术的耦合的关键挑战。我们希望这篇评论可以为读者提供医学图像分析领域的读者的全面图片。

translated by 谷歌翻译

Deep Reinforced Attention Learning for Quality-Aware Visual Recognition

Duo Li , Qifeng Chen

分类：计算机视觉

2020-07-13

在本文中，我们基于任何卷积神经网络中中间注意图的弱监督生成机制，并更加直接地披露了注意模块的有效性，以充分利用其潜力。鉴于现有的神经网络配备了任意注意模块，我们介绍了一个元评论家网络，以评估主网络中注意力图的质量。由于我们设计的奖励的离散性，提出的学习方法是在强化学习环境中安排的，在此设置中，注意力参与者和经常性的批评家交替优化，以提供临时注意力表示的即时批评和修订，因此，由于深度强化的注意力学习而引起了人们的关注。（Dreal）。它可以普遍应用于具有不同类型的注意模块的网络体系结构，并通过最大程度地提高每个单独注意模块产生的最终识别性能的相对增益来促进其表现能力，如类别和实例识别基准的广泛实验所证明的那样。

translated by 谷歌翻译

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

Botao Ye , Hong Chang , Bingpeng Ma , Shiguang Shan , Xilin Chen

分类：计算机视觉

2022-03-22

The current popular two-stream, two-stage tracking framework extracts the template and the search region features separately and then performs relation modeling, thus the extracted features lack the awareness of the target and have limited target-background discriminability. To tackle the above issue, we propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling by bridging the template-search image pairs with bidirectional information flows. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. Since no extra heavy relation modeling module is needed and the implementation is highly parallelized, the proposed tracker runs at a fast speed. To further improve the inference efficiency, an in-network candidate early elimination module is proposed based on the strong similarity prior calculated in the one-stream framework. As a unified framework, OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k, i.e., achieving 73.7% AO, improving the existing best result (SwinTrack) by 4.3\%. Besides, our method maintains a good performance-speed trade-off and shows faster convergence. The code and models are available at https://github.com/botaoye/OSTrack.

translated by 谷歌翻译

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Fei Xie , Chunyu Wang , Guangting Wang , Wankou Yang , Wenjun Zeng

分类：计算机视觉

2021-12-05

我们介绍了一个基于仅用于跟踪的变压器的暹罗样的双分支网络。给定模板和搜索映像，我们将它们分成非重叠补丁，并基于其在注意窗口中的其他人的匹配结果提取每个补丁的特征向量。对于每个令牌，我们估计它是否包含目标对象和相应的大小。该方法的优点是，该特征从匹配中学到，最终匹配。因此，功能与目标跟踪任务对齐。该方法实现更好或比较的结果作为首先使用CNN提取特征的最佳性能，然后使用变压器熔断它们。它优于GOT-10K和VOT2020基准上的最先进的方法。此外，该方法在一个GPU上实现了实时推理速度（约为40美元的FPS）。代码和模型将被释放。

translated by 谷歌翻译

Feature Re-calibration based MIL for Whole Slide Image Classification

Philip Chikontwe , Soo Jeong Nam , Heounjeong Go , Meejeong Kim , Hyun Jung Sung , Sang Hyun Park

分类：计算机视觉

2022-06-22

整个幻灯片图像（WSI）分类是诊断和治疗疾病的基本任务；但是，精确标签的策划是耗时的，并限制了完全监督的方法的应用。为了解决这个问题，多个实例学习（MIL）是一种流行的方法，它仅使用幻灯片级标签作为一个弱监督的学习任务。尽管当前的MIL方法将注意机制的变体应用于具有更强模型的重量实例特征，但注意力不足是对数据分布的属性的不足。在这项工作中，我们建议通过使用Max-Instance（关键）功能的统计数据来重新校准WSI袋（实例）的分布。我们假设在二进制MIL中，正面袋的特征幅度大于负面，因此我们可以强制执行该模型，以最大程度地利用公制特征损失的袋子之间的差异，该袋子将正面袋模型为未分布。为了实现这一目标，与使用单批训练模式的现有MIL方法不同，我们建议平衡批次采样以有效地使用功能丢失，即同时（+/-）袋子。此外，我们采用编码模块（PEM）的位置来建模空间/形态信息，并通过变压器编码器通过多头自我注意（PSMA）进行汇总。现有基准数据集的实验结果表明我们的方法是有效的，并且对最先进的MIL方法有所改善。

translated by 谷歌翻译

Multiple Instance Learning with Mixed Supervision in Gleason Grading

Hao Bian , Zhuchen Shao , Yang Chen , Yifeng Wang , Haoqian Wang , Jian Zhang , Yongbing Zhang

分类：计算机视觉

2022-06-26

随着计算病理学的发展，通过整个幻灯片图像（WSIS）的Gleason评分的深度学习方法具有良好的前景。由于WSIS的大小非常大，因此图像标签通常仅包含幻灯片级标签或有限的像素级标签。当前的主流方法采用了多个实体学习来预测格里森等级。但是，某些方法仅考虑幻灯片级标签，忽略了包含丰富本地信息的有限像素级标签。此外，考虑到像素级标签的另外方法忽略了像素级标签的不准确性。为了解决这些问题，我们根据多个实例学习框架提出了一个混合监督变压器。该模型同时使用幻灯片级标签和实例级别标签，以在幻灯片级别实现更准确的Gleason分级。通过在混合监督培训过程中引入有效的随机掩盖策略，进一步降低了实例级标签的影响。我们在SICAPV2数据集上实现了最新性能，视觉分析显示了实例级别的准确预测结果。源代码可从https://github.com/bianhao123/mixed_supervision获得。

translated by 谷歌翻译

Class-Aware Adversarial Transformers for Medical Image Segmentation

Chenyu You , Ruihan Zhao , Fenglin Liu , Siyuan Dong , Sandeep Chinchali , Ufuk Topcu , Lawrence Staib , James S. Duncan

分类：计算机视觉 | 人工智能 | 机器学习

2022-01-26

Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.

translated by 谷歌翻译

An Entropy-guided Reinforced Partial Convolutional Network for Zero-Shot Learning

Yun Li , Zhe Liu , Lina Yao , Xianzhi Wang , Julian McAuley , Xiaojun Chang

分类：计算机视觉 | 机器学习

2021-11-03

零拍摄学习（ZSL）旨在通过语义相关转移观察到的课程的学习知识。有希望的策略是学习一个全球本地代表，将全球信息纳入额外的地方（即输入的小部分/地区）。但是，现有方法根据显式功能发现本地，而无需挖掘区域内部属性和关系。在这项工作中，我们提出了一种新的熵引导的增强部分卷积网络（ERPCNET），其基于没有人为注释区域的语义相关性和视觉相关性地提取和聚集在地区。 ERPCNET使用加强部分卷积和熵指导;它不仅在动态发现全球合作的地方，而且还可以更快地收敛于政策梯度优化。我们通过在ZSL和四个基准数据集中的ZSL和广义零射击学习（GZSL）设置下，通过比较来展示ERPCNET的性能。我们还显示ERPCNet是时间高效，可通过可视化分析来解释。

translated by 谷歌翻译

TransCrowd: weakly-supervised crowd counting with transformers

Dingkang Liang , Xiwu Chen , Wei Xu , Yu Zhou , Xiang Bai

分类：计算机视觉

2021-04-19

主流人群计数方法通常利用卷积神经网络（CNN）回归密度图，需要点级注释。但是，用一点点注释每个人是一个昂贵且费力的过程。在测试阶段，未考虑点级注释来评估计数精度，这意味着点级注释是冗余的。因此，希望开发仅依赖计数级注释的弱监督计数方法，这是一种更经济的标签方式。当前的弱监督计数方法采用了CNN来通过图像计数范式回归人群的总数。但是，对于上下文建模的接受场有限是这些基于CNN的弱监督法的内在局限性。因此，在现实世界中的应用有限的情况下，这些方法无法实现令人满意的性能。变压器是自然语言处理（NLP）中流行的序列到序列预测模型，其中包含一个全球接收场。在本文中，我们提出了transercroderd，从基于变压器的序列到计数的角度来重新制定了弱监督的人群计数问题。我们观察到，所提出的译者可以使用变压器的自发机制有效地提取语义人群信息。据我们所知，这是第一项采用纯变压器进行人群计算研究的工作。五个基准数据集的实验表明，与所有基于弱的CNN的计数方法相比，所提出的transercroud的性能优于较高的性能，并且与某些流行的完全监督的计数方法相比，基于CNN的计数方法和提高了竞争激烈的计数性能。

translated by 谷歌翻译

DGMIL: Distribution Guided Multiple Instance Learning for Whole Slide Image Classification

Linhao Qu , Xiaoyuan Luo , Shaolei Liu , Manning Wang , Zhijian Song

分类：计算机视觉

2022-06-17

多个实例学习（MIL）广泛用于分析组织病理学全幻灯片图像（WSIS）。但是，现有的MIL方法不会明确地对数据分配进行建模，而仅通过训练分类器来歧视行李级或实例级决策边界。在本文中，我们提出了DGMIL：一个特征分布引导为WSI分类和阳性贴剂定位的深度MIL框架。我们没有设计复杂的判别网络体系结构，而是揭示组织病理学图像数据的固有特征分布可以作为分类的非常有效的指南。我们提出了一种集群条件的特征分布建模方法和基于伪标签的迭代特征空间改进策略，以便在最终特征空间中，正面和负面实例可以轻松分离。 CamelyOn16数据集和TCGA肺癌数据集的实验表明，我们的方法为全球分类和阳性贴剂定位任务提供了新的SOTA。

translated by 谷歌翻译

Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics

Chunyuan Li , Xinliang Zhu , Jiawen Yao , Junzhou Huang

分类：计算机视觉 | 机器学习

2022-11-29

Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. Previous studies employ multiple instance learning (MIL) to represent WSIs as bags of sampled patches because, for most occasions, only slide-level labels are available, and only a tiny region of the WSI is disease-positive area. However, WSI representation learning still remains an open problem due to: (1) patch sampling on a higher resolution may be incapable of depicting microenvironment information such as the relative position between the tumor cells and surrounding tissues, while patches at lower resolution lose the fine-grained detail; (2) extracting patches from giant WSI results in large bag size, which tremendously increases the computational cost. To solve the problems, this paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Precisely, we randomly extract instant-level patch features from WSIs with different magnification. Then a co-attention mapping between imaging and genomics is learned to uncover the pairwise interaction and reduce the space complexity of imaging features. Such early fusion makes it computationally feasible to use MIL Transformer for the survival prediction task. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability. We evaluate our approach on five cancer types from the Cancer Genome Atlas database and achieved an average c-index of $0.673$, outperforming the state-of-the-art multimodality methods.

translated by 谷歌翻译

Learning Spatial-Frequency Transformer for Visual Object Tracking

Chuanming Tang , Xiao Wang , Yuanchao Bai , Zhe Wu , Jianlin Zhang , Yongmei Huang

分类：计算机视觉

2022-08-18

最近的跟踪器采用变压器来组合或替换广泛使用的重新NET作为其新的骨干网络。尽管他们的跟踪器在常规场景中运行良好，但是他们只是将2D功能弄平为序列，以更好地匹配变压器。我们认为这些操作忽略了目标对象的空间先验，这可能仅导致次优结果。此外，许多作品表明，自我注意力实际上是一个低通滤波器，它与输入功能或键/查询无关。也就是说，它可能会抑制输入功能的高频组成部分，并保留甚至放大低频信息。为了解决这些问题，在本文中，我们提出了一个统一的空间频率变压器，该变压器同时建模高斯空间先验和高频强调（GPHA）。具体而言，高斯空间先验是使用双重多层感知器（MLP）生成的，并注入了通过将查询和自我注意的关键特征乘产生的相似性矩阵。输出将被馈入软磁层，然后分解为两个组件，即直接信号和高频信号。低通和高通的分支被重新缩放并组合以实现全通，因此，高频特征将在堆叠的自发层中得到很好的保护。我们进一步将空间频率变压器整合到暹罗跟踪框架中，并提出一种新颖的跟踪算法，称为SFTRANST。基于跨级融合的SwintransFormer被用作骨干，还使用多头交叉意见模块来增强搜索和模板功能之间的相互作用。输出将被馈入跟踪头以进行目标定位。短期和长期跟踪基准的广泛实验都证明了我们提出的框架的有效性。

translated by 谷歌翻译

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

Ziyi Tang , Ruimao Zhang , Zhanglin Peng , Jinrui Chen , Liang Lin

分类：计算机视觉

2023-01-02

In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

translated by 谷歌翻译