Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires.
translated by 谷歌翻译
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译
Neural Radiance Fields (NeRFs) encode the radiance in a scene parameterized by the scene's plenoptic function. This is achieved by using an MLP together with a mapping to a higher-dimensional space, and has been proven to capture scenes with a great level of detail. Naturally, the same parameterization can be used to encode additional properties of the scene, beyond just its radiance. A particularly interesting property in this regard is the semantic decomposition of the scene. We introduce a novel technique for semantic soft decomposition of neural radiance fields (named SSDNeRF) which jointly encodes semantic signals in combination with radiance signals of a scene. Our approach provides a soft decomposition of the scene into semantic parts, enabling us to correctly encode multiple semantic classes blending along the same direction -- an impossible feat for existing methods. Not only does this lead to a detailed, 3D semantic representation of the scene, but we also show that the regularizing effects of the MLP used for encoding help to improve the semantic representation. We show state-of-the-art segmentation and reconstruction results on a dataset of common objects and demonstrate how the proposed approach can be applied for high quality temporally consistent video editing and re-compositing on a dataset of casually captured selfie videos.
translated by 谷歌翻译
Memes are powerful means for effective communication on social media. Their effortless amalgamation of viral visuals and compelling messages can have far-reaching implications with proper marketing. Previous research on memes has primarily focused on characterizing their affective spectrum and detecting whether the meme's message insinuates any intended harm, such as hate, offense, racism, etc. However, memes often use abstraction, which can be elusive. Here, we introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes. To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities - heroes, villains, and victims, encompassing 4,680 entities present in 3K memes. We also benchmark ExHVV with several strong unimodal and multimodal baselines. Moreover, we posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally by jointly learning to predict the correct semantic roles and correspondingly to generate suitable natural language explanations. LUMEN distinctly outperforms the best baseline across 18 standard natural language generation evaluation metrics. Our systematic evaluation and analyses demonstrate that characteristic multimodal cues required for adjudicating semantic roles are also helpful for generating suitable explanations.
translated by 谷歌翻译
In this paper we address the solution of the popular Wordle puzzle, using new reinforcement learning methods, which apply more generally to adaptive control of dynamic systems and to classes of Partially Observable Markov Decision Process (POMDP) problems. These methods are based on approximation in value space and the rollout approach, admit a straightforward implementation, and provide improved performance over various heuristic approaches. For the Wordle puzzle, they yield on-line solution strategies that are very close to optimal at relatively modest computational cost. Our methods are viable for more complex versions of Wordle and related search problems, for which an optimal strategy would be impossible to compute. They are also applicable to a wide range of adaptive sequential decision problems that involve an unknown or frequently changing environment whose parameters are estimated on-line.
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
Cement is the most used construction material. The performance of cement hydrate depends on the constituent phases, viz. alite, belite, aluminate, and ferrites present in the cement clinker, both qualitatively and quantitatively. Traditionally, clinker phases are analyzed from optical images relying on a domain expert and simple image processing techniques. However, the non-uniformity of the images, variations in the geometry and size of the phases, and variabilities in the experimental approaches and imaging methods make it challenging to obtain the phases. Here, we present a machine learning (ML) approach to detect clinker microstructure phases automatically. To this extent, we create the first annotated dataset of cement clinker by segmenting alite and belite particles. Further, we use supervised ML methods to train models for identifying alite and belite regions. Specifically, we finetune the image detection and segmentation model Detectron-2 on the cement microstructure to develop a model for detecting the cement phases, namely, Cementron. We demonstrate that Cementron, trained only on literature data, works remarkably well on new images obtained from our experiments, demonstrating its generalizability. We make Cementron available for public use.
translated by 谷歌翻译
训练深度强化学习(DRL)运动策略通常需要大量数据以融合到所需的行为。在这方面,模拟器提供了便宜而丰富的来源。对于成功的SIM到现实转移,通常采用详尽的设计方法,例如系统识别,动态随机化和域的适应性。作为替代方案,我们研究了一种简单的随机力注射策略(RFI),以在训练过程中扰动系统动力学。我们表明,随机力的应用使我们能够模拟动力学随机化。这使我们能够获得对系统动力学变化的强大运动策略。我们通过引入情节驱动偏移,进一步扩展了RFI,称为延长的随机力注射(ERFI)。我们证明,ERFI为系统质量提供的变化提供了额外的鲁棒性,平均提供了比RFI的性能提高61%。我们还表明,ERFI足以在两个不同的四足动物平台(Anymal C和Unitree A1)上成功进行SIM到真实传输,即使在户外环境中对不均匀的地形上的感知运动也是如此。
translated by 谷歌翻译
我们提出SERP,这是3D点云的自我监督学习的框架。 SERP由编码器编码器架构组成,该体系结构将被扰动或损坏的点云作为输入和旨在重建原始点云而无需损坏。编码器在低维子空间中学习了点云的高级潜在表示,并恢复原始结构。在这项工作中,我们使用了基于变压器和基于点网的自动编码器。所提出的框架还解决了基于变形金刚的掩盖自动编码器的一些局限性,这些框架容易泄漏位置信息和不均匀的信息密度。我们在完整的Shapenet数据集上训练了模型,并将它们作为下游分类任务评估。我们已经表明,审慎的模型比从头开始训练的网络实现了0.5-1%的分类精度。此外,我们还提出了VASP:对矢量定量的自动编码器,用于对点云进行自我监督的表示学习,这些学习用于基于变压器的自动编码器的离散表示学习。
translated by 谷歌翻译
过程学习涉及确定键步并确定其逻辑顺序以执行任务。现有方法通常使用第三人称视频来学习该过程,使操纵对象的外观很小,并且经常被演员遮住,从而导致重大错误。相比之下,我们观察到从第一人称(Egentric)可穿戴摄像机获得的视频提供了对动作的毫无开创且清晰的视野。但是,从以eg中心视频学习的程序学习是具有挑战性的,因为(a)由于佩戴者的头部运动,相机视图发生了极端变化,并且(b)由于视频的不受约束性质而存在无关的框架。因此,当前的最新方法的假设是,该动作大约同时发生并且持续时间相同,因此不持有。取而代之的是,我们建议使用视频键位之间的时间对应关系提供的信号。为此,我们提出了一个新颖的自我监督对应和剪切(CNC),用于程序学习。 CNC识别并利用多个视频的键步之间的时间对应关系来学习该过程。我们的实验表明,CNC的表现分别优于基准Procel和Crosstask数据集上的最先进,分别为5.2%和6.3%。此外,对于使用以Egentric视频为中心的程序学习,我们建议使用Egoprocel数据集,该数据集由130名受试者捕获的62个小时的视频组成,执行16个任务。源代码和数据集可在项目页面https://sid2697.github.io/egoprocel/上获得。
translated by 谷歌翻译