Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.
translated by 谷歌翻译
Running machine learning inference on tiny devices, known as TinyML, is an emerging research area. This task requires generating inference code that uses memory frugally, a task that standard ML frameworks are ill-suited for. A deployment framework for TinyML must be a) parametric in the number representation to take advantage of the emerging representations like posits, b) carefully assign high-precision to a few tensors so that most tensors can be kept in low-precision while still maintaining model accuracy, and c) avoid memory fragmentation. We describe MinUn, the first TinyML framework that holistically addresses these issues to generate efficient code for ARM microcontrollers (e.g., Arduino Uno, Due and STM32H747) that outperforms the prior TinyML frameworks.
translated by 谷歌翻译
基于标记的光运动捕获(OMC)系统和相关的肌肉骨骼(MSK)建模预测提供了能够洞悉体内关节和肌肉载荷的能力,并有助于临床决策。但是,OMC系统基于实验室,昂贵,需要视线。一种广泛使用的替代方案是惯性运动捕获(IMC)系统,该系统具有便携式,用户友好且相对较低的成本,尽管它不如OMC系统准确。不管选择运动捕获技术的选择,都需要使用MSK模型来获取运动学和动力学输出,这是一种计算昂贵的工具,越来越多地通过机器学习(ML)方法近似。在这里,我们提出了一种ML方法,将IMC数据映射到从OMC输入数据计算出的人类上限MSK输出。从本质上讲,我们试图从相对易于获取的IMC数据中预测高质量的MSK输出。我们使用同一受试者同时收集的OMC和IMC数据来训练ML(前馈多层感知器)模型,该模型可预测IMC测量值的基于OMC的MSK输出。我们证明我们的ML预测与所需的基于OMC的MSK估计值具有很高的一致性。因此,这种方法将有助于将基于OMC的系统不可行的“实验室到现场”的技术发挥作用。
translated by 谷歌翻译
我们为电视节目和电影等媒体内容中的主动扬声器检测提供了一个跨模式的无监督框架。机器学习的进步使能够从语音和面部图像中识别个人方面令人印象深刻的表现。我们利用言语和面部的说话者身份信息,并将主动的说话者检测作为语音面条分配任务,从而使主动的说话者的脸和基本语音识别同一个人(角色)。我们以相关的说话者身份距离(来自所有其他语音段)来表达语音段,以捕获视频的相对身份结构。然后,我们从同时出现的面上的每个语音段分配一个主动扬声器的面孔,以使所获得的一组活跃的扬声器面显示相似的相对身份结构。此外,我们提出了一种简单有效的方法来解决言语在屏幕外出现的语音细分。我们在三个基准数据集上评估了拟议的系统 - 视觉人群聚类数据集,AVA Active Speaker数据集和哥伦比亚数据集 - 由娱乐和广播媒体的视频组成,并显示出对最先进的竞争性能,充分监督方法。
translated by 谷歌翻译
ML-AS-A-Service继续增长,对非常强大的隐私保证的需求也在继续增长。安全推断已成为潜在的解决方案,其中加密原始图允许推理不向用户向用户揭示用户的输入或模型的权重。例如,模型提供商可以是一家诊断公司,该公司已经培训了一种最先进的Densenet-121模型来解释胸部X射线,并且用户可以在医院成为患者。尽管对于这种环境,确保推理原则上是可行的,但没有现有的技术使其大规模实用。 Cryptflow2框架提供了一种潜在的解决方案,其能力自动,正确地将清晰文本推理转换为安全模型的推断。但是,从Cryptflow2产生的安全推断在不切实际上很昂贵:在Densenet-121上解释单个X射线需要几乎3TB的通信。在本文中,我们解决了针对三项贡献的安全推断效率低下的重大挑战。首先,我们证明安全推理中的主要瓶颈是大型线性层,可以通过选择网络骨干的选择来优化,并使用用于有效的清晰文本推理开发的操作员。这一发现和强调与许多最近的作品偏离,这些作品着重于在执行较小网络的安全推断时优化非线性激活层。其次,基于对瓶颈卷积层的分析,我们设计了一个更有效的倒入替代品的X操作器。第三,我们表明,快速的Winograd卷积算法进一步提高了安全推断的效率。结合使用,这三个优化被证明对在CHEXPERT数据集中训练的X射线解释问题非常有效。
translated by 谷歌翻译
图形神经网络(GNNS)从节点功能和输入图拓扑中利用信号来改善节点分类任务性能。然而,这些模型倾向于在异细胞图上表现不良,其中连接的节点具有不同的标记。最近提出了GNNS横跨具有不同程度的同性恋级别的图表。其中,依赖于多项式图滤波器的模型已经显示了承诺。我们观察到这些多项式图滤波器模型的解决方案也是过度确定的方程式系统的解决方案。它表明,在某些情况下,模型需要学习相当高的多项式。在调查中,我们发现由于其设计而在学习此类多项式的拟议模型。为了缓解这个问题,我们执行图表的特征分解,并建议学习作用于频谱的不同子集的多个自适应多项式滤波器。理论上和经验证明我们所提出的模型学习更好的过滤器,从而提高了分类准确性。我们研究了我们提出的模型的各个方面,包括利用潜在多项式滤波器的依义组分的数量以及节点分类任务上的各个多项式的性能的依赖性。我们进一步表明,我们的模型通过在大图中评估来扩展。我们的模型在最先进的模型上实现了高达5%的性能增益,并且通常优于现有的基于多项式滤波器的方法。
translated by 谷歌翻译
观察生存数据的因果结构提供了关于协变量和事件时间之间关系的重要信息。我们从信息理论源编码参数中获得动机,并且如果采用合适的源编码器,则显示结合所指示的非循环图(DAG)的知识可以是有益的。作为在此上下文中的可能的源编码器中,我们推导出基于变分推理的条件变分性Autiachiater用于因果结构化生存预测,我们将其称为Dagsurv。我们说明了Dagsurv在低和高维合成数据集中的性能,以及诸如元数据集等现实数据集,如元数据集。我们证明,该方法优于其他生存分析基线,如Cox比例危害,Deepsurv和Deephit,这对数据实体之间的潜在因果关系感到遗憾。
translated by 谷歌翻译
对媒体描绘的客观理解,例如在电影和电视中被听到并在屏幕上听到并在屏幕上看到和看过的包容性描写,要求机器自动辨别谁,何时,如何以及某人正在谈论的人,而不是。可以从媒体内容中存在的丰富的多模式信息自动侦听扬声器活动。然而,由于媒体内容中的众多种类和上下文可变性以及缺乏标记数据,这是一个具有挑战性的问题。在这项工作中,我们提出了一种用于学习视觉表示的跨模型神经网络,其具有与视觉帧中扬声器的空间位置有关的隐式信息。避免对视觉帧中的活动扬声器进行手动注释,获取非常昂贵的是,我们为在电影内容中定位有源扬声器的任务提供弱监督系统。我们使用学习的跨模型视觉表示,并从充当语音活动的电影字幕提供弱监督,从而需要没有手动注释。我们评估所提出的系统在AVA主动扬声器数据集上的性能,并展示与完全监督系统相比,跨模型嵌入式的跨模型嵌入式的有效性。我们还展示了语音活动检测任务在视听框架中的最先进的性能,尤其是当语音伴随着噪声和音乐时。
translated by 谷歌翻译
Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee," thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.
translated by 谷歌翻译
Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.
translated by 谷歌翻译