Athletes routinely undergo fitness evaluations to evaluate their training progress. Typically, these evaluations require a trained professional who utilizes specialized equipment like force plates. For the assessment, athletes perform drop and squat jumps, and key variables are measured, e.g. velocity, flight time, and time to stabilization, to name a few. However, amateur athletes may not have access to professionals or equipment that can provide these assessments. Here, we investigate the feasibility of estimating key variables using video recordings. We focus on jump velocity as a starting point because it is highly correlated with other key variables and is important for determining posture and lower-limb capacity. We find that velocity can be estimated with a high degree of precision across a range of athletes, with an average R-value of 0.71 (SD = 0.06).
translated by 谷歌翻译
Camera-based physiological measurement is a growing field with neural models providing state-the-art-performance. Prior research have explored various "end-to-end" models; however these methods still require several preprocessing steps. These additional operations are often non-trivial to implement making replication and deployment difficult and can even have a higher computational budget than the "core" network itself. In this paper, we propose two novel and efficient neural models for camera-based physiological measurement called EfficientPhys that remove the need for face detection, segmentation, normalization, color space transformation or any other preprocessing steps. Using an input of raw video frames, our models achieve strong performance on three public datasets. We show that this is the case whether using a transformer or convolutional backbone. We further evaluate the latency of the proposed networks and show that our most light weight network also achieves a 33% improvement in efficiency.
translated by 谷歌翻译
与2020年相比,由于注意力和嗜睡的增加,汽车撞车事故增长了20%。昏昏欲睡和分心的驾驶是所有车祸的45%的原因。作为减少昏昏欲睡和分心的驾驶的一种手段,使用计算机视觉的检测方法可以设计为低成本,准确和微创。这项工作调查了视觉变压器以优于3D-CNN的最先进准确性。两个独立的变压器接受了嗜睡和分心。昏昏欲睡的视频变压器模型接受了全国Tsing-hua大学昏昏欲睡的驾驶数据集(NTHU-DDD)的培训,其中有一个视频Swin Transformer模型,可在两个类别上进行10个时代 - 昏昏欲睡和非der脚模拟10.5个小时。分散注意力的视频变压器在驾驶员监视数据集(DMD)上接受了带有视频SWIN变压器的50个时代的培训,该时期超过9个分心相关的类。嗜睡模型的准确性达到44%,测试集的损失值高,表明过度拟合和模型性能差。过度拟合表明有限的培训数据和应用模型体系结构缺乏可量化的参数。分散注意力的模型优于DMD上的最新模型,达到97.5%,表明有足够的数据和强大的体系结构,变压器适合不适合驾驶检测。未来的研究应使用较新的模型,例如Tokenlearner来实现更高的准确性和效率,合并现有数据集以扩展以检测酒后驾车和道路愤怒,以创建全面的解决方案,以防止交通崩溃,并部署功能性的原型,以革新自动安全安全性行业。
translated by 谷歌翻译
心脏磁共振成像通常用于评估心脏解剖结构和功能。左心室血池和左心室心肌的描述对于诊断心脏疾病很重要。不幸的是,在CMR采集程序中,患者的运动可能会导致最终图像中出现的运动伪像。这种伪像降低了CMR图像的诊断质量和对程序的重做。在本文中,我们提出了一个多任务SWIN UNET变压器网络,用于在CMRXMOTION挑战中同时解决两个任务:CMR分割和运动伪像分类。我们将细分和分类作为多任务学习方法,使我们能够确定CMR的诊断质量并同时生成口罩。 CMR图像分为三个诊断质量类别,而所有具有非严重运动伪像的样本都被分割。使用5倍交叉验证训练的五个网络的合奏实现了骰子系数为0.871的分割性能,分类精度为0.595。
translated by 谷歌翻译
translated by 谷歌翻译
Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).
translated by 谷歌翻译
手术字幕在手术指导预测和报告生成中起重要作用。但是,大多数字幕模型仍然依赖重量计算对象检测器或特征提取器来提取区域特征。此外,检测模型需要其他边界框注释,这是昂贵的,需要熟练的注释器。这些导致推断延迟,并限制字幕模型在实时机器人手术中部署。为此,我们通过利用基于贴片的移位窗口技术来设计端到端检测器和功能无提取器字幕模型。我们建议以更快的推理速度和更少的计算,建议基于窗口的多层感知器变压器字幕模型(SWINMLP-TRANCAP)。 SwinMLP-Trancap用基于窗口的多头MLP代替了多头注意模块。这样的部署主要集中在图像理解任务上,但是很少有工作研究标题生成任务。 Swinmlp-trancap还扩展到视频版本,用于使用3D补丁和Windows的视频字幕任务。与以前的基于检测器或基于特征提取器的模型相比,我们的模型在维护两个手术数据集上的性能的同时,大大简化了体系结构设计。该代码可在上公开获得。
translated by 谷歌翻译
根据诊断各种疾病的胸部X射线图像的可观增长,以及收集广泛的数据集,使用深神经网络进行了自动诊断程序,已经占据了专家的思想。计算机视觉中的大多数可用方法都使用CNN主链来获得分类问题的高精度。然而,最近的研究表明,在NLP中成为事实上方法的变压器也可以优于许多基于CNN的模型。本文提出了一个基于SWIN变压器的多标签分类深模型,作为实现最新诊断分类的骨干。它利用了头部体系结构来利用多层感知器(也称为MLP)。我们评估了我们的模型,该模型称为“ Chest X-Ray14”,最广泛,最大的X射线数据集之一,该数据集由30,000多名14例著名胸部疾病的患者组成100,000多个额叶/背景图像。我们的模型已经用几个数量的MLP层用于头部设置,每个模型都在所有类别上都达到了竞争性的AUC分数。胸部X射线14的全面实验表明,与以前的SOTA平均AUC为0.799相比,三层头的平均AUC得分为0.810,其平均AUC得分为0.810。我们建议对现有方法进行公平基准测试的实验设置,该设置可以用作未来研究的基础。最后,我们通过确认所提出的方法参与胸部的病理相关区域,从而跟进了结果。
translated by 谷歌翻译
为了处理变异长度的长视频,先前的作品提取了多模式功能并将其融合以预测学生的参与强度。在本文中,我们在视频变压器(CAVT)中提出了一个新的端到端方法类的关注,该方法涉及一个向量来处理类嵌入并均匀地对变异长的视频和固定的端到端学习 - 长度短视频。此外,为了解决缺乏足够的样本,我们提出了一种二进制代表采样方法(BOR)来添加每个视频的多个视频序列以增强训练集。BORS+CAVT不仅可以在EMOTIW-EP数据集上实现最先进的MSE(0.0495),而且还可以在Daisee数据集上获得最新的MSE(0.0377)。代码和模型将在上公开提供。
translated by 谷歌翻译
translated by 谷歌翻译
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at
translated by 谷歌翻译
预计未来几十年的全球粮食不安全将加速气候变化率和人口迅速增加。在这种静脉中,重要的是在每种饮食生产水平上消除效率低下。最近深入学习的进步可以帮助降低这种效率低下,但他们的申请尚未成为整个行业的主流,以大规模的规模诱导经济成本。为此,已将现代技术(如CNNS(卷积神经网络)应用于RPQD(原始产生质量检测)任务。另一方面,变压器在其他方式中的视野中的成功首次亮相使我们能够在RPQD中预计这些基于变压器的模型更好的性能。在这项工作中,我们专门调查了最近的最先进的水流(移位的Windows)变压器,这些变压器可以在窗口和窗口间的方式中计算自我关注。我们将Swin变压器与CNN模型进行比较四个RPQD图像数据集,每个CNN模型都包含不同种类的生成:水果和蔬菜,鱼类,猪肉和牛肉。我们观察到Swin Transformer不仅实现了更好或更有竞争力的性能,而且还具有数据和计算效率,使其成为现实世界的实际部署的理想选择。据我们所知,这是第一个对RPQD任务的大规模实证研究,我们希望在未来的作品中更加关注。
translated by 谷歌翻译
虽然变形金机对视频识别任务的巨大潜力具有较强的捕获远程依赖性的强大能力,但它们经常遭受通过对视频中大量3D令牌的自我关注操作引起的高计算成本。在本文中,我们提出了一种新的变压器架构,称为双重格式,可以有效且有效地对视频识别进行时空关注。具体而言,我们的Dualformer将完全时空注意力分层到双级级联级别,即首先在附近的3D令牌之间学习细粒度的本地时空交互,然后捕获查询令牌之间的粗粒度全局依赖关系。粗粒度全球金字塔背景。不同于在本地窗口内应用时空分解或限制关注计算以提高效率的现有方法,我们本地 - 全球分层策略可以很好地捕获短期和远程时空依赖项,同时大大减少了钥匙和值的数量在注意计算提高效率。实验结果表明,对抗现有方法的五个视频基准的经济优势。特别是,Dualformer在动态-400/600上设置了新的最先进的82.9%/ 85.2%,大约1000g推理拖鞋,比具有相似性能的现有方法至少3.2倍。
translated by 谷歌翻译
我们将视频Swin Transformer作为基础体系结构实现,用于无返回时间定位和对象状态变更分类的任务。我们的方法在两个挑战上都取得了竞争性能。
translated by 谷歌翻译
In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and tracking from the fused features. Extensive experiments on LiveVideos dataset show the superiority of our proposed method.
translated by 谷歌翻译
translated by 谷歌翻译
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github. com/microsoft/Swin-Transformer.
translated by 谷歌翻译
translated by 谷歌翻译
随着卷积神经网络(CNN)的蓬勃发展,诸如VGG-16和Resnet-50之类的CNN广泛用作SAR船检测中的骨架。但是,基于CNN的骨干很难对远程依赖性进行建模,并且导致缺乏浅层特征图中缺乏足够的高质量语义信息,从而导致在复杂的背景和小型船只中的检测性能不佳。为了解决这些问题,我们提出了一种基于SWIN Transformer的SAR船检测方法,并提出了功能增强功能功能金字塔网络(FEFPN)。SWIN Transformer用作建模远程依赖性并生成层次特征图的骨架。提出了FEFPN,以进一步提高特征地图的质量,通过逐渐增强各级特征地图的语义信息,尤其是浅层中的特征地图。在SAR船检测数据集(SSDD)上进行的实验揭示了我们提出的方法的优势。
translated by 谷歌翻译
translated by 谷歌翻译