Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.
translated by 谷歌翻译
Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.
translated by 谷歌翻译
自我监督的学习(SSL)通过大量未标记的数据的先知,在各种医学成像任务上取得了出色的性能。但是,对于特定的下游任务,仍然缺乏有关如何选择合适的借口任务和实现细节的指令书。在这项工作中,我们首先回顾了医学成像分析领域中自我监督方法的最新应用。然后,我们进行了广泛的实验,以探索SSL中的四个重要问题用于医学成像,包括(1)自我监督预处理对不平衡数据集的影响,(2)网络体系结构,(3)上游任务对下游任务和下游任务和下游任务的适用性(4)SSL和常用政策用于深度学习的堆叠效果,包括数据重新采样和增强。根据实验结果,提出了潜在的指南,以在医学成像中进行自我监督预处理。最后,我们讨论未来的研究方向并提出问题,以了解新的SSL方法和范式时要注意。
translated by 谷歌翻译
详细的肺气道分割是支撑周围肺癌病变的支撑室干预和治疗的临床重要任务。卷积神经网络(CNN)是医学图像分析的有前途的工具,但对于出现不平衡功能分布的情况,案件的性能较差,这对于气道数据是正确的,因为气管和主要支气管在大部分voxels中占主导支气管和远端节段支气管仅占用一小部分。在本文中,我们提出了一个可区分的拓扑保存距离变换(DTPDT)框架,以提高气道分割的性能。首先提出了拓扑保存的替代(TPS)学习策略,以均衡课堂分布的培训进度。此外,卷积距离变换(CDT)旨在识别具有提高灵敏度的破裂现象,从而最大程度地减少了预测和地面真实之间距离图的变化。提出的方法已通过公开可用的参考气道细分数据集进行验证。
translated by 谷歌翻译
LIDC-IDRI数据库是肺癌预测的最流行的基准。但是,通过放射科医生的主观评估,LIDC中的结节可能与病理基础真理具有完全不同的恶性注释,从而引入了标签分配错误,并在培训期间引起了后续的监督偏见。因此,LIDC数据库需要更多的客观标签来基于学习的癌症预测。基于一个额外的小数据集,该数据集包含通过病理检查诊断的180个结节,我们建议重新标记LIDC数据,以减轻对此强大基准测试的原始注释偏差的影响。我们在本文中证明,基于度量学习的类似结节检索提供新标签将是一种有效的重新标记策略。对这些重新标记的LIDC结节进行的培训可改善模型性能,当添加不确定的结节的新标签时,这将增强。我们进一步推断出,重新标记的LIDC是最终的良好肺癌预测的方便方法,同时构建大型病理预处理的结节数据库提供了长期解决方案。
translated by 谷歌翻译
我们开发了一个新颖的框架,将稀疏集团拉索的正规化者添加到深度学习中的自适应优化者家族中,例如动量,亚当,亚当,阿姆斯格拉德,阿德哈西亚人,并创建了新的优化者,这些优化者被称为群体动量,命名因此,Adagrad小组,亚当集团,Amsgrad集团和Adahessian集团等。我们基于原始偶的方法在随机凸设置中建立理论上证明的收敛保证。我们评估了新优化器对具有最先进的深度学习模型的三个大型现实广告单击数据集的正则效应。实验结果表明,与使用幅度修剪方法的后处理过程相比,模型的性能可以在相同的稀疏度水平上显着提高。此外,与没有幅度修剪的情况相比,我们的方法可以实现极高的稀疏性,并具有明显的更好或高度竞争性的性能。
translated by 谷歌翻译
为了研究非对比计算断层扫描(CT)周围的胸膜,气道和血管是否可以区分良性和恶性肺结核。 LIDC-IDRI DataSet是最大的公共CT数据库之一进行了研究。共有1556例来自694名患者的结节涉及统计分析,其中平均刻录3和> 3的结节分别表示为良性和恶性肿瘤。此外,来自113例诊断患者的339个结节独立地评估了诊断原律。将计算机算法开发成肺部结构并量化胸膜表面,气道和血管的距离,以及结节附近的呼吸道和血管的计数数和归一化。进行差距(或)和Chi-Square(\ Chi ^ 2)测试以证明周围结构的特征与结节恶性肿瘤之间的相关性。在逻辑回归中进行非参数接收器操作特征(ROC)分析,以评估每个结构的辨别能力。对于良性和恶性群体,分别从结节到胸膜,气道和血管的平均距离(6.56,5.19),(37.08,26.43)和(1.42,17.07)mm。结节与通气通路的计数和血管数的相关性分别(或= 22.96,\ Chi ^ 2 = 105.04)和(或= 7.06,\ Chi ^ 2 = 290.11)。结节之间的相关性和气道和血管的体积是(或= 9.19,\ Chi ^ 2 = 159.02)和(或= 2.29,\ Chi ^ 2 = 55.89)。胸膜,呼吸道和血管的曲线下曲线(AUC)分别为0.5202,0.6943和0.6529。我们的研究结果表明,与良性的,恶性结节通常被更多的肺部结构包围,表明这些结构的特征可以被视为肺癌生物标志物。
translated by 谷歌翻译
神经网络的高计算成本阻止了RGB-D突出物体检测(SOD)的最新成功,从受益现实世界应用。因此,本文介绍了一种新颖的网络,Mobily,它专注于使用移动网络进行深度特征提取的高效RGB-D SOD。然而,移动网络在特征表示中的功能较小比麻烦的网络更强大。为此,我们观察到彩色图像的深度信息可以加强与SOD相关的特征表示,如果正确杠杆。因此,我们提出了一种隐式深度恢复(IDR)技术,以加强用于RGB-D SOD的移动网络的特征表示能力。 IDR仅在训练阶段采用并在测试期间省略,因此它是免费的。此外,我们提出了用于有效的多级特征聚合的紧凑金字塔精制(CPR),以获得具有清晰边界的突出对象。与IDR和CPR合并,Mobilesal在六个挑战RGB-D SOD数据集上具有更快的速度(450fps 320 $ 320的输入尺寸为320美元)和更少的参数(6.5米)。代码在https://mmcheng.net/mobilesal发布。
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
translated by 谷歌翻译