Over the past few years, developing a broad, universal, and general-purpose computer vision system has become a hot topic. A powerful universal system would be capable of solving diverse vision tasks simultaneously without being restricted to a specific problem or a specific data domain, which is of great importance in practical real-world computer vision applications. This study pushes the direction forward by concentrating on the million-scale multi-domain universal object detection problem. The problem is not trivial due to its complicated nature in terms of cross-dataset category label duplication, label conflicts, and the hierarchical taxonomy handling. Moreover, what is the resource-efficient way to utilize emerging large pre-trained vision models for million-scale cross-dataset object detection remains an open challenge. This paper tries to address these challenges by introducing our practices in label handling, hierarchy-aware loss design and resource-efficient model training with a pre-trained large model. Our method is ranked second in the object detection track of Robust Vision Challenge 2022 (RVC 2022). We hope our detailed study would serve as an alternative practice paradigm for similar problems in the community. The code is available at https://github.com/linfeng93/Large-UniDet.
translated by 谷歌翻译
在这项研究中,提出了一种集成检测模型,即Swin-Transformer-Yolov5或Swin-T-Yolov5,用于实时葡萄酒葡萄束检测,以继承Yolov5和Swin-Transformer的优势。该研究是针对2019年7月至9月的两种不同的霞多丽(始终白色或白色混合浆果皮肤)和梅洛(白色或白色混合浆果皮肤)的研究。从2019年7月至9月。 -yolov5,其性能与几个常用/竞争性对象探测器进行了比较,包括更快的R-CNN,Yolov3,Yolov4和Yolov5。在不同的测试条件下评估了所有模型,包括两个不同的天气条件(阳光和多云),两个不同的浆果成熟度(不成熟和成熟)以及三个不同的阳光方向/强度(早晨,中午和下午)进行全面比较。此外,Swin-t-Yolov5的预测葡萄束数量与地面真实值进行了比较,包括在注释过程中的现场手动计数和手动标记。结果表明,拟议的SWIN-T-YOLOV5的表现优于所有其他研究的葡萄束检测模型,当天气多云时,最高平均平均精度(MAP)和0.89的F1得分的97%。该地图分别比更快的R-CNN,Yolov3,Yolov4和Yolov5大约大约44%,18%,14%和4%。当检测到未成熟的浆果时,Swin-T-Yolov5获得了最低的地图(90%)和F1分数(0.82),其中该地图大约比相同的浆果大约40%,5%,3%和1%。此外,在将预测与地面真相进行比较时,Swin-T-Yolov5在Chardonnay品种上的表现更好,最多可达到R2的0.91和2.36根均方根误差(RMSE)。但是,它在Merlot品种上的表现不佳,仅达到R2和3.30的RMSE的0.70。
translated by 谷歌翻译
在本文中,我们提出了一个多模式的多关系学习框架,针对视听语音分离的任务。尽管以前的努力已经在结合音频和视觉方式方面进行了广泛的努力,但其中大多数仅采用音频和视觉功能的直接串联。为了利用这两种方式背后的实际有用信息,我们定义了两个关键相关性,即:(1)身份相关性(在音色和面部属性之间); (2)语音相关性(音素和唇部运动之间)。这两种相关性共同包含完整的信息,这表明将目标扬声器的声音分开,尤其是在某些困难的情况下,例如相同的性别或类似内容。为了实施,采用对比度学习或对抗性训练方法来最大化这两个相关性。他们俩都表现良好,而对抗性训练则通过避免对比度学习的某些局限性显示出其优势。与先前的研究相比,我们的解决方案证明了对实验指标的明显改进而没有额外的复杂性。进一步的分析揭示了拟议的体系结构的有效性及其未来扩展的良好潜力。
translated by 谷歌翻译
高斯流程(GP)模型是一类灵活的非参数模型,具有丰富的代表力。通过使用具有添加剂结构的高斯工艺,可以在保持解释性的同时对复杂的响应进行建模。先前的工作表明,加性高斯工艺模型需要高维相互作用项。我们提出了正交添加剂(OAK),该核(OAK)对添加功能施加正交性约束,从而实现了功能关系的可识别,低维表示。我们将OAK内核连接到功能方差分析分解,并显示出稀疏计算方法的收敛速率。与黑盒模型相比,我们只有少量的添加剂低维术语,在保持可解释性的同时,橡木模型的预测性能相似或更好。
translated by 谷歌翻译
本文回顾了关于压缩视频质量增强质量的第一个NTIRE挑战,重点是拟议的方法和结果。在此挑战中,采用了新的大型不同视频(LDV)数据集。挑战有三个曲目。Track 1和2的目标是增强HEVC在固定QP上压缩的视频,而Track 3旨在增强X265压缩的视频,以固定的位速率压缩。此外,轨道1和3的质量提高了提高保真度(PSNR)的目标,以及提高感知质量的2个目标。这三个曲目完全吸引了482个注册。在测试阶段,分别提交了12个团队,8支球队和11支球队,分别提交了轨道1、2和3的最终结果。拟议的方法和解决方案衡量视频质量增强的最先进。挑战的首页:https://github.com/renyang-home/ntire21_venh
translated by 谷歌翻译
One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.
translated by 谷歌翻译
Conversational recommender systems (CRSs) often utilize external knowledge graphs (KGs) to introduce rich semantic information and recommend relevant items through natural language dialogues. However, original KGs employed in existing CRSs are often incomplete and sparse, which limits the reasoning capability in recommendation. Moreover, only few of existing studies exploit the dialogue context to dynamically refine knowledge from KGs for better recommendation. To address the above issues, we propose the Variational Reasoning over Incomplete KGs Conversational Recommender (VRICR). Our key idea is to incorporate the large dialogue corpus naturally accompanied with CRSs to enhance the incomplete KGs; and perform dynamic knowledge reasoning conditioned on the dialogue context. Specifically, we denote the dialogue-specific subgraphs of KGs as latent variables with categorical priors for adaptive knowledge graphs refactor. We propose a variational Bayesian method to approximate posterior distributions over dialogue-specific subgraphs, which not only leverages the dialogue corpus for restructuring missing entity relations but also dynamically selects knowledge based on the dialogue context. Finally, we infuse the dialogue-specific subgraphs to decode the recommendation and responses. We conduct experiments on two benchmark CRSs datasets. Experimental results confirm the effectiveness of our proposed method.
translated by 谷歌翻译
Recognition of facial expression is a challenge when it comes to computer vision. The primary reasons are class imbalance due to data collection and uncertainty due to inherent noise such as fuzzy facial expressions and inconsistent labels. However, current research has focused either on the problem of class imbalance or on the problem of uncertainty, ignoring the intersection of how to address these two problems. Therefore, in this paper, we propose a framework based on Resnet and Attention to solve the above problems. We design weight for each class. Through the penalty mechanism, our model will pay more attention to the learning of small samples during training, and the resulting decrease in model accuracy can be improved by a Convolutional Block Attention Module (CBAM). Meanwhile, our backbone network will also learn an uncertain feature for each sample. By mixing uncertain features between samples, the model can better learn those features that can be used for classification, thus suppressing uncertainty. Experiments show that our method surpasses most basic methods in terms of accuracy on facial expression data sets (e.g., AffectNet, RAF-DB), and it also solves the problem of class imbalance well.
translated by 谷歌翻译
Attention-based arbitrary style transfer studies have shown promising performance in synthesizing vivid local style details. They typically use the all-to-all attention mechanism: each position of content features is fully matched to all positions of style features. However, all-to-all attention tends to generate distorted style patterns and has quadratic complexity. It virtually limits both the effectiveness and efficiency of arbitrary style transfer. In this paper, we rethink what kind of attention mechanism is more appropriate for arbitrary style transfer. Our answer is a novel all-to-key attention mechanism: each position of content features is matched to key positions of style features. Specifically, it integrates two newly proposed attention forms: distributed and progressive attention. Distributed attention assigns attention to multiple key positions; Progressive attention pays attention from coarse to fine. All-to-key attention promotes the matching of diverse and reasonable style patterns and has linear complexity. The resultant module, dubbed StyA2K, has fine properties in rendering reasonable style textures and maintaining consistent local structure. Qualitative and quantitative experiments demonstrate that our method achieves superior results than state-of-the-art approaches.
translated by 谷歌翻译
For low-level computer vision and image processing ML tasks, training on large datasets is critical for generalization. However, the standard practice of relying on real-world images primarily from the Internet comes with image quality, scalability, and privacy issues, especially in commercial contexts. To address this, we have developed a procedural synthetic data generation pipeline and dataset tailored to low-level vision tasks. Our Unreal engine-based synthetic data pipeline populates large scenes algorithmically with a combination of random 3D objects, materials, and geometric transformations. Then, we calibrate the camera noise profiles to synthesize the noisy images. From this pipeline, we generated a fully synthetic image denoising dataset (FSID) which consists of 175,000 noisy/clean image pairs. We then trained and validated a CNN-based denoising model, and demonstrated that the model trained on this synthetic data alone can achieve competitive denoising results when evaluated on real-world noisy images captured with smartphone cameras.
translated by 谷歌翻译