我们在室外环境中自动驾驶的背景下研究了视觉和语言导航(VLN)问题。我们通过明确接地与Textual命令相对应的可通道区域来解决问题。在每个时间戳,该模型预测与中间或最终可通道区域相对应的分割掩码。我们的工作与VLN中的现有工作形成鲜明对比,VLN的现有工作将该任务置于节点选择问题,并且给定与环境相对应的离散连接图。我们不假定这种离散的地图的可用性。我们的工作朝着动作领域的连续性发展,通过视觉反馈提供了解释性,并允许在需要更精细的操作的命令上进行VLN,例如“两辆汽车之间的停车”。此外,我们提出了一种新型的元数据carla-nav,以允许有效的训练和验证。该数据集包括预录制的培训序列以及用于验证和测试的实时环境。我们提供广泛的定性和定量经验结果,以验证所提出的方法的功效。
translated by 谷歌翻译
Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.
translated by 谷歌翻译
Deep neural networks can approximate functions on different types of data, from images to graphs, with varied underlying structure. This underlying structure can be viewed as the geometry of the data manifold. By extending recent advances in the theoretical understanding of neural networks, we study how a randomly initialized neural network with piece-wise linear activation splits the data manifold into regions where the neural network behaves as a linear function. We derive bounds on the density of boundary of linear regions and the distance to these boundaries on the data manifold. This leads to insights into the expressivity of randomly initialized deep neural networks on non-Euclidean data sets. We empirically corroborate our theoretical results using a toy supervised learning problem. Our experiments demonstrate that number of linear regions varies across manifolds and the results hold with changing neural network architectures. We further demonstrate how the complexity of linear regions is different on the low dimensional manifold of images as compared to the Euclidean space, using the MetFaces dataset.
translated by 谷歌翻译
Our aim is to build autonomous agents that can solve tasks in environments like Minecraft. To do so, we used an imitation learning-based approach. We formulate our control problem as a search problem over a dataset of experts' demonstrations, where the agent copies actions from a similar demonstration trajectory of image-action pairs. We perform a proximity search over the BASALT MineRL-dataset in the latent representation of a Video PreTraining model. The agent copies the actions from the expert trajectory as long as the distance between the state representations of the agent and the selected expert trajectory from the dataset do not diverge. Then the proximity search is repeated. Our approach can effectively recover meaningful demonstration trajectories and show human-like behavior of an agent in the Minecraft environment.
translated by 谷歌翻译
Maximum Inner Product Search (MIPS) is a popular problem in the machine learning literature due to its applicability in a wide array of applications, such as recommender systems. In high-dimensional settings, however, MIPS queries can become computationally expensive as most existing solutions do not scale well with data dimensionality. In this work, we present a state-of-the-art algorithm for the MIPS problem in high dimensions, dubbed BanditMIPS. BanditMIPS is a randomized algorithm that borrows techniques from multi-armed bandits to reduce the MIPS problem to a best-arm identification problem. BanditMIPS reduces the complexity of state-of-the-art algorithms from $O(\sqrt{d})$ to $O(\text{log}d)$, where $d$ is the dimension of the problem data vectors. On high-dimensional real-world datasets, BanditMIPS runs approximately 12 times faster than existing approaches and returns the same solution. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$\alpha$, which employs non-uniform sampling across the data dimensions to provide further speedups.
translated by 谷歌翻译
Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.
translated by 谷歌翻译
Concept bottleneck models (CBMs) (Koh et al. 2020) are interpretable neural networks that first predict labels for human-interpretable concepts relevant to the prediction task, and then predict the final label based on the concept label predictions.We extend CBMs to interactive prediction settings where the model can query a human collaborator for the label to some concepts. We develop an interaction policy that, at prediction time, chooses which concepts to request a label for so as to maximally improve the final prediction. We demonstrate thata simple policy combining concept prediction uncertainty and influence of the concept on the final prediction achieves strong performance and outperforms a static approach proposed in Koh et al. (2020) as well as active feature acquisition methods proposed in the literature. We show that the interactiveCBM can achieve accuracy gains of 5-10% with only 5 interactions over competitive baselines on the Caltech-UCSDBirds, CheXpert and OAI datasets.
translated by 谷歌翻译
Changes in real-world dynamic processes are often described in terms of differences in energies $\textbf{E}(\underline{\alpha})$ of a set of spectral-bands $\underline{\alpha}$. Given continuous spectra of two classes $A$ and $B$, or in general, two stochastic processes $S^{(A)}(f)$ and $S^{(B)}(f)$, $f \in \mathbb{R}^+$, we address the ubiquitous problem of identifying a subset of intervals of $f$ called spectral-bands $\underline{\alpha} \subset \mathbb{R}^+$ such that the energies $\textbf{E}(\underline{\alpha})$ of these bands can optimally discriminate between the two classes. We introduce EGO-MDA, an unsupervised method to identify optimal spectral-bands $\underline{\alpha}^*$ for given samples of spectra from two classes. EGO-MDA employs a statistical approach that iteratively minimizes an adjusted multinomial log-likelihood (deviance) criterion $\mathcal{D}(\underline{\alpha},\mathcal{M})$. Here, Mixture Discriminant Analysis (MDA) aims to derive MLE of two GMM distribution parameters, i.e., $\mathcal{M}^* = \underset{\mathcal{M}}{\rm argmin}~\mathcal{D}(\underline{\alpha}, \mathcal{M})$ and identify a classifier that optimally discriminates between two classes for a given spectral representation. The Efficient Global Optimization (EGO) finds the spectral-bands $\underline{\alpha}^* = \underset{\underline{\alpha}}{\rm argmin}~\mathcal{D}(\underline{\alpha},\mathcal{M})$ for given GMM parameters $\mathcal{M}$. For pathological cases of low separation between mixtures and model misspecification, we discuss the effect of the sample size and the number of iterations on the estimates of parameters $\mathcal{M}$ and therefore the classifier performance. A case study on a synthetic data set is provided. In an engineering application of optimal spectral-banding for anomaly tracking, EGO-MDA achieved at least 70% improvement in the median deviance relative to other methods tested.
translated by 谷歌翻译
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
translated by 谷歌翻译
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community in the past few years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames for obtaining global weighted temporal fusion outcome. The proposed SWTF is divided into two components. First, a temporal segment network that sparsely samples a given set of frames. Second, weighted temporal fusion, that incorporates a fusion of feature maps derived from optical flow, with raw RGB images. This is followed by base-network, which comprises a convolutional neural network module along with fully connected layers that provide us with activity recognition. The SWTF network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a significant margin.
translated by 谷歌翻译