驾驶场景理解是智能交通系统的关键要素。为了实现能够在复杂的物理和社会环境中运行的系统,他们需要了解和学习人类如何驾驶和与交通场景交互。我们介绍本田研究院驾驶数据集(HDD),这是一个具有挑战性的数据集,可用于研究现实生活环境中的学习驾驶员行为。该数据集包括使用配备有不同传感器的装备车辆收集的旧金山湾区104小时的人类驾驶。我们提供HDD的详细分析,并与其他驾驶数据集进行比较。引入了一种新颖的注释方法,使得能够从未修剪的数据序列中研究驾驶员行为理解。作为第一步,对驾驶员行为检测的基线算法进行训练和测试,以证明所提议任务的可行性。
translated by 谷歌翻译
Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events. We propose the Joint Event Detection and Description Network (JEDDi-Net), which solves the dense video captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features , and generates their captions. Proposal features are extracted within each proposal segment through 3D Segment-of-Interest pooling from shared video feature encoding. In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context. On the large-scale Activi-tyNet Captions dataset, JEDDi-Net demonstrates improved results as measured by standard metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
translated by 谷歌翻译