Flow-guide synthesis provides a common framework for frame interpolation, where optical flow is typically estimated by a pyramid network, and then leveraged to guide a synthesis network to generate intermediate frames between input frames. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), UPR-Net achieves excellent performance on a large range of benchmarks. Code will be available soon.
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
Drowsiness on the road is a widespread problem with fatal consequences; thus, a multitude of systems and techniques have been proposed. Among existing methods, Ghoddoosian et al. utilized temporal blinking patterns to detect early signs of drowsiness, but their algorithm was tested only on a powerful desktop computer, which is not practical to apply in a moving vehicle setting. In this paper, we propose an efficient platform to run Ghoddosian's algorithm, detail the performance tests we ran to determine this platform, and explain our threshold optimization logic. After considering the Jetson Nano and Beelink (Mini PC), we concluded that the Mini PC is the most efficient and practical to run our embedded system in a vehicle. To determine this, we ran communication speed tests and evaluated total processing times for inference operations. Based on our experiments, the average total processing time to run the drowsiness detection model was 94.27 ms for Jetson Nano and 22.73 ms for the Beelink (Mini PC). Considering the portability and power efficiency of each device, along with the processing time results, the Beelink (Mini PC) was determined to be most suitable. Also, we propose a threshold optimization algorithm, which determines whether the driver is drowsy or alert based on the trade-off between the sensitivity and specificity of the drowsiness detection model. Our study will serve as a crucial next step for drowsiness detection research and its application in vehicles. Through our experiment, we have determinend a favorable platform that can run drowsiness detection algorithms in real-time and can be used as a foundation to further advance drowsiness detection research. In doing so, we have bridged the gap between an existing embedded system and its actual implementation in vehicles to bring drowsiness technology a step closer to prevalent real-life implementation.
组织病理学仍然是各种癌症诊断的黄金标准。计算机视觉的最新进展,特别是深度学习,促进了针对各种任务的组织病理学图像的分析,包括免疫细胞检测和微卫星不稳定性分类。每个任务的最新工作通常采用鉴定的基础体系结构,这些体系结构已鉴定为图像分类。开发组织病理学分类器的标准方法倾向于将重点放在优化单个任务的模型上,而不是考虑建模创新的各个方面,从而改善了跨任务的概括。在这里,我们提出了Champkit(模型预测工具包的全面组织病理学评估):可扩展的,完全可重现的基准测试工具包,由大量的斑点级图像分类任务组成,跨不同的癌症。 Champkit能够系统地记录模型和方法中提议改进的性能影响的一种方法。 Champkit源代码和数据可在https://github.com/kaczmarj/champkit上自由访问。
基于深度学习的技术为自动图像质量评估(IQA)领域的显着进步做出了贡献。现有的IQA方法旨在根据图像级别(即整个图像)或贴片级(将图像分为多个单元和测量每个图像的质量在图像级别(即整个图像)处的平均意见分数(MOS)来衡量图像的质量修补)。某些应用可能需要评估像素级别(即每个像素的MOS值)处的质量,但是,由于其网络结构而丢失了空间信息,因此在现有技术的情况下不可能评估这是不可能的。本文提出了一种IQA算法,除图像级MOS外,还可以测量像素级的MOS。提出的算法由三个核心部分组成,即:i)本地IQA; ii)感兴趣的区域(ROI)预测; iii)高级功能嵌入。本地IQA部件在像素级或像素MOS上输出MOS - 我们称其为“ PMOS”。 ROI预测部分输出的权重来计算图像级IQA时区域的相对重要性。嵌入零件的高级特征提取高级图像特征,然后将其嵌入到本地IQA部分中。换句话说,提出的算法产生三个输出:代表每个像素的MOS的PMO,来自ROI的权重表示区域的相对重要性,最后是通过PMOS和ROI加权总和获得的图像级MOS值。与现有流行的IQA技术相比,通过使用PMO和ROI权重获得的图像级MOS表现出较高的性能。此外,可视化结果表明,预测的PMO和ROI输出与人类视觉系统(HVS)的一般原理相当一致。
暂时视频接地(TVG)旨在根据自然语言查询将时间段定位在未修饰的视频中。在这项工作中,我们提出了一个名为TVG探索和匹配的新范式,该范式无缝地统一了两种TVG方法:无提案和基于提案的方法;前者探索了直接查找细分市场的搜索空间,后者将预定义的提案与地面真相相匹配。为了实现这一目标,我们将TVG视为一个设定的预测问题,并设计了可端到端的可训练的语言视频变压器(LVTR),该视频变压器(LVTR)利用了丰富的上下文化和平行解码的建筑优势来设置预测。总体培训时间表与两次扮演不同角色的关键损失,即时间定位损失和设定指导损失的平衡。这两个损失允许每个建议可以回归目标细分并确定目标查询。更具体地说,LVTR首先探索搜索空间以使初始建议多样化,然后将建议与相应的目标匹配,以细粒度的方式对齐它们。探索和匹配方案成功地结合了两种互补方法的优势,而无需将先验知识(例如,非最大抑制)编码到TVG管道中。结果,LVTR在两个TVG基准(ActivityCaptions and Charades-sta)上设定了新的最新结果,其推理速度是两倍。代码可在https://github.com/sangminwoo/explore-and-match上找到。
