The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.
translated by 谷歌翻译
Diverse data formats and ontologies of task-oriented dialogue (TOD) datasets hinder us from developing general dialogue models that perform well on many datasets and studying knowledge transfer between datasets. To address this issue, we present ConvLab-3, a flexible dialogue system toolkit based on a unified TOD data format. In ConvLab-3, different datasets are transformed into one unified format and loaded by models in the same way. As a result, the cost of adapting a new model or dataset is significantly reduced. Compared to the previous releases of ConvLab (Lee et al., 2019b; Zhu et al., 2020b), ConvLab-3 allows developing dialogue systems with much more datasets and enhances the utility of the reinforcement learning (RL) toolkit for dialogue policies. To showcase the use of ConvLab-3 and inspire future work, we present a comprehensive study with various settings. We show the benefit of pre-training on other datasets for few-shot fine-tuning and RL, and encourage evaluating policy with diverse user simulators.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
We apply computer vision pose estimation techniques developed expressly for the data-scarce infant domain to the study of torticollis, a common condition in infants for which early identification and treatment is critical. Specifically, we use a combination of facial landmark and body joint estimation techniques designed for infants to estimate a range of geometric measures pertaining to face and upper body symmetry, drawn from an array of sources in the physical therapy and ophthalmology research literature in torticollis. We gauge performance with a range of metrics and show that the estimates of most these geometric measures are successful, yielding strong to very strong Spearman's $\rho$ correlation with ground truth values. Furthermore, we show that these estimates, derived from pose estimation neural networks designed for the infant domain, cleanly outperform estimates derived from more widely known networks designed for the adult domain
translated by 谷歌翻译
Generative Adversarial Networks (GANs) have paved the path towards entirely new media generation capabilities at the forefront of image, video, and audio synthesis. However, they can also be misused and abused to fabricate elaborate lies, capable of stirring up the public debate. The threat posed by GANs has sparked the need to discern between genuine content and fabricated one. Previous studies have tackled this task by using classical machine learning techniques, such as k-nearest neighbours and eigenfaces, which unfortunately did not prove very effective. Subsequent methods have focused on leveraging on frequency decompositions, i.e., discrete cosine transform, wavelets, and wavelet packets, to preprocess the input features for classifiers. However, existing approaches only rely on isotropic transformations. We argue that, since GANs primarily utilize isotropic convolutions to generate their output, they leave clear traces, their fingerprint, in the coefficient distribution on sub-bands extracted by anisotropic transformations. We employ the fully separable wavelet transform and multiwavelets to obtain the anisotropic features to feed to standard CNN classifiers. Lastly, we find the fully separable transform capable of improving the state-of-the-art.
translated by 谷歌翻译
来自多个RGB摄像机的无标记人类运动捕获(MOCAP)是一个广泛研究的问题。现有方法要么需要校准相机,要么相对于静态摄像头校准它们,该摄像头是MOCAP系统的参考框架。每个捕获会话都必须先验完成校准步骤,这是一个乏味的过程,并且每当有意或意外移动相机时,都需要重新校准。在本文中,我们提出了一种MOCAP方法,该方法使用了多个静态和移动的外部未校准的RGB摄像机。我们方法的关键组成部分如下。首先,由于相机和受试者可以自由移动,因此我们选择接地平面作为常见参考,以代表身体和相机运动,与代表摄像机坐标中身体的现有方法不同。其次,我们了解相对于接地平面的短人类运动序列($ \ sim $ 1SEC)的概率分布,并利用它在摄像机和人类运动之间消除歧义。第三,我们将此分布用作一种新型的多阶段优化方法的运动,以适合SMPL人体模型,并且摄像机在图像上的人体关键点构成。最后,我们证明我们的方法可以在从航空摄像机到智能手机的各种数据集上使用。与使用静态摄像头的单眼人类MOCAP任务相比,它还提供了更准确的结果。我们的代码可在https://github.com/robot-ception-group/smartmocap上进行研究。
translated by 谷歌翻译
人类不断与日常对象互动以完成任务。为了了解这种相互作用,计算机需要从观察全身与场景的全身相互作用的相机中重建这些相互作用。由于身体和物体之间的阻塞,运动模糊,深度/比例模棱两可以及手和可抓握的物体零件的低图像分辨率,这是具有挑战性的。为了使问题可以解决,社区要么专注于互动的手,忽略身体或互动的身体,无视双手。 Grab数据集解决了灵活的全身互动,但使用基于标记的MOCAP并缺少图像,而行为则捕获了身体对象互动的视频,但缺乏手动细节。我们使用参数全身模型SMPL-X和已知的对象网格来解决一种新的方法,该方法与Intercap的先前工作局限性,该方法是一种新的方法,可重建从多视图RGB-D数据进行交互的整体和对象。为了应对上述挑战,Intercap使用了两个关键观察:(i)可以使用手和物体之间的接触来改善两者的姿势估计。 (ii)Azure Kinect传感器使我们能够建立一个简单的多视图RGB-D捕获系统,该系统在提供合理的相机间同步时最小化遮挡的效果。使用此方法,我们捕获了Intercap数据集,其中包含10个受试者(5名男性和5个女性)与10个各种尺寸和负担的物体相互作用,包括与手或脚接触。 Intercap总共有223个RGB-D视频,产生了67,357个多视图帧,每个帧包含6个RGB-D图像。我们的方法为每个视频框架提供了伪真正的身体网格和对象。我们的Intercap方法和数据集填补了文献中的重要空白,并支持许多研究方向。我们的数据和代码可用于研究目的。
translated by 谷歌翻译
自2016年成立以来,Alexa奖计划使数百名大学生能够通过Socialbot Grand Challenge探索和竞争以发展对话代理商。挑战的目的是建立能够与人类在流行主题上连贯而诱人的代理人20分钟,同时达到至少4.0/5.0的平均评分。但是,由于对话代理商试图帮助用户完成日益复杂的任务,因此需要新的对话AI技术和评估平台。成立于2021年的Alexa奖Taskbot Challenge建立在Socialbot Challenge的成功基础上,通过引入交互式协助人类进行现实世界烹饪和做自己动手做的任务的要求,同时同时使用语音和视觉方式。这项挑战要求TaskBots识别和理解用户的需求,识别和集成任务和域知识,并开发新的方式,不分散用户的注意力,而不必分散他们的任务,以及其他挑战。本文概述了Taskbot挑战赛,描述了使用Cobot Toolkit提供给团队提供的基础架构支持,并总结了参与团队以克服研究挑战所采取的方法。最后,它分析了比赛第一年的竞争任务机器人的性能。
translated by 谷歌翻译
关于神经体系结构搜索(NAS)的现有研究主要集中于有效地搜索具有更好性能的网络体系结构。几乎没有取得进展,以系统地了解NAS搜索的架构是否对隐私攻击是强大的,而丰富的工作已经表明,人类设计的架构容易受到隐私攻击。在本文中,我们填补了这一空白,并系统地衡量了NAS体系结构的隐私风险。利用我们的测量研究中的见解,我们进一步探索了基于细胞的NAS架构的细胞模式,并评估细胞模式如何影响NAS搜索架构的隐私风险。通过广泛的实验,我们阐明了如何针对隐私攻击设计强大的NAS体系结构,还提供了一种通用方法,以了解NAS搜索的体系结构与其他隐私风险之间的隐藏相关性。
translated by 谷歌翻译