在riemannian歧管中,Ricci流是用于发展度量的部分微分方程,以便更加规则。我们希望来自此类指标的拓扑结构可用于帮助机器学习的任务。然而,这部分工作仍然缺失。在本文中,我们通过动态稳定的Poincar eMinddings来弥合Ricci流和深神经网络之间的这种差距。结果,我们证明,如果初始指标有$ L ^ 2 $ -norm扰动,它偏离了Poincar \'E球上的双曲度量,这种度量的缩放RICCI-DECurck流程平滑,并将其归因于双曲测量。具体地,Ricci流的作用是用作稳定的Poincar的EAll自然地发展,然后将被映射回欧几里德空间。对于在RICCI流下的这种动态稳定的神经歧管中,嵌入这种歧管的神经网络的收敛性不易受到扰动。我们表明,这种RICCI流动辅助神经网络与其在图像分类任务(CIFAR数据集)上的所有欧几里德版本胜过。
translated by 谷歌翻译
盲目图像超分辨率(SR)是CV的长期任务,旨在恢复患有未知和复杂扭曲的低分辨率图像。最近的工作主要集中在采用更复杂的退化模型来模拟真实世界的降级。由此产生的模型在感知损失和产量感知令人信服的结果取得了突破性。然而,电流生成的对抗性网络结构所带来的限制仍然是显着的:处理像素同样地导致图像的结构特征的无知,并且导致性能缺点,例如扭曲线和背景过度锐化或模糊。在本文中,我们提出了A-ESRAN,用于盲人SR任务的GAN模型,其特色是基于U-NET的U-NET的多尺度鉴别器,可以与其他发电机无缝集成。据我们所知,这是第一项介绍U-Net结构作为GaN解决盲人问题的鉴别者的工作。本文还给出了对模型的多规模注意力突破的机制的解释。通过对现有作品的比较实验,我们的模型在非参考自然图像质量评估员度量上提出了最先进的水平性能。我们的消融研究表明,利用我们的鉴别器,基于RRDB的发电机可以利用多种尺度中图像的结构特征,因此与先前作品相比,更加感知地产生了感知的高分辨率图像。
translated by 谷歌翻译
360-degree panoramic videos have gained considerable attention in recent years due to the rapid development of head-mounted displays (HMDs) and panoramic cameras. One major problem in streaming panoramic videos is that panoramic videos are much larger in size compared to traditional ones. Moreover, the user devices are often in a wireless environment, with limited battery, computation power, and bandwidth. To reduce resource consumption, researchers have proposed ways to predict the users' viewports so that only part of the entire video needs to be transmitted from the server. However, the robustness of such prediction approaches has been overlooked in the literature: it is usually assumed that only a few models, pre-trained on past users' experiences, are applied for prediction to all users. We observe that those pre-trained models can perform poorly for some users because they might have drastically different behaviors from the majority, and the pre-trained models cannot capture the features in unseen videos. In this work, we propose a novel meta learning based viewport prediction paradigm to alleviate the worst prediction performance and ensure the robustness of viewport prediction. This paradigm uses two machine learning models, where the first model predicts the viewing direction, and the second model predicts the minimum video prefetch size that can include the actual viewport. We first train two meta models so that they are sensitive to new training data, and then quickly adapt them to users while they are watching the videos. Evaluation results reveal that the meta models can adapt quickly to each user, and can significantly increase the prediction accuracy, especially for the worst-performing predictions.
translated by 谷歌翻译
在过去的十年中,许多深入学习模型都受到了良好的培训,并在各种机器智能领域取得了巨大成功,特别是对于计算机视觉和自然语言处理。为了更好地利用这些训练有素的模型在域内或跨域转移学习情况下,提出了知识蒸馏(KD)和域适应(DA)并成为研究亮点。他们旨在通过原始培训数据从训练有素的模型转移有用的信息。但是,由于隐私,版权或机密性,原始数据并不总是可用的。最近,无数据知识转移范式吸引了吸引人的关注,因为它涉及从训练有素的模型中蒸馏宝贵的知识,而无需访问培训数据。特别是,它主要包括无数据知识蒸馏(DFKD)和源无数据域适应(SFDA)。一方面,DFKD旨在将域名域内知识从一个麻烦的教师网络转移到一个紧凑的学生网络,以进行模型压缩和有效推论。另一方面,SFDA的目标是重用存储在训练有素的源模型中的跨域知识并将其调整为目标域。在本文中,我们对知识蒸馏和无监督域适应的视角提供了全面的数据知识转移,以帮助读者更好地了解目前的研究状况和想法。分别简要审查了这两个领域的应用和挑战。此外,我们对未来研究的主题提供了一些见解。
translated by 谷歌翻译
对象的时间建模是多个对象跟踪(MOT)的关键挑战。现有方法通过通过基于运动和基于外观的相似性启发式方法关联检测来跟踪。关联的后处理性质阻止了视频序列中时间变化的端到端。在本文中,我们提出了MOTR,它扩展了DETR并介绍了轨道查询,以模拟整个视频中的跟踪实例。轨道查询被转移并逐帧更新,以随着时间的推移执行迭代预测。我们提出了曲目感知的标签分配,以训练轨道查询和新生儿对象查询。我们进一步提出了时间聚集网络和集体平均损失,以增强时间关系建模。 Dancetrack上的实验结果表明,MOTR在HOTA度量方面的表现明显优于最先进的方法,字节范围为6.5%。在MOT17上,MOTR在关联性能方面优于我们的并发作品,跟踪器和Transtrack。 MOTR可以作为对时间建模和基于变压器的跟踪器的未来研究的更强基线。代码可在https://github.com/megvii-research/motr上找到。
translated by 谷歌翻译
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译