translated by 谷歌翻译
We show how the inherent, but often neglected, properties of large-scale LiDAR point clouds can be exploited for effective self-supervised representation learning. To this end, we design a highly data-efficient feature pre-training backbone that significantly reduces the amount of tedious 3D annotations to train state-of-the-art object detectors. In particular, we propose a Masked AutoEncoder (MAELi) that intuitively utilizes the sparsity of the LiDAR point clouds in both, the encoder and the decoder, during reconstruction. This results in more expressive and useful features, directly applicable to downstream perception tasks, such as 3D object detection for autonomous driving. In a novel reconstruction scheme, MAELi distinguishes between free and occluded space and leverages a new masking strategy which targets the LiDAR's inherent spherical projection. To demonstrate the potential of MAELi, we pre-train one of the most widespread 3D backbones, in an end-to-end fashion and show the merit of our fully unsupervised pre-trained features on several 3D object detection architectures. Given only a tiny fraction of labeled frames to fine-tune such detectors, we achieve significant performance improvements. For example, with only $\sim800$ labeled frames, MAELi features improve a SECOND model by +10.09APH/LEVEL 2 on Waymo Vehicles.
translated by 谷歌翻译
Existing Multiple Object Tracking (MOT) methods design complex architectures for better tracking performance. However, without a proper organization of input information, they still fail to perform tracking robustly and suffer from frequent identity switches. In this paper, we propose two novel methods together with a simple online Message Passing Network (MPN) to address these limitations. First, we explore different integration methods for the graph node and edge embeddings and put forward a new IoU (Intersection over Union) guided function, which improves long term tracking and handles identity switches. Second, we introduce a hierarchical sampling strategy to construct sparser graphs which allows to focus the training on more difficult samples. Experimental results demonstrate that a simple online MPN with these two contributions can perform better than many state-of-the-art methods. In addition, our association method generalizes well and can also improve the results of private detection based methods.
translated by 谷歌翻译
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}.
translated by 谷歌翻译
Keyless entry systems in cars are adopting neural networks for localizing its operators. Using test-time adversarial defences equip such systems with the ability to defend against adversarial attacks without prior training on adversarial samples. We propose a test-time adversarial example detector which detects the input adversarial example through quantifying the localized intermediate responses of a pre-trained neural network and confidence scores of an auxiliary softmax layer. Furthermore, in order to make the network robust, we extenuate the non-relevant features by non-iterative input sample clipping. Using our approach, mean performance over 15 levels of adversarial perturbations is increased by 55.33% for the fast gradient sign method (FGSM) and 6.3% for both the basic iterative method (BIM) and the projected gradient method (PGD).
translated by 谷歌翻译
translated by 谷歌翻译
尽管近年来行动认可取得了令人印象深刻的结果,但视频培训数据的收集和注释仍然很耗时和成本密集。因此,已经提出了图像到视频改编,以利用无标签的Web图像源来适应未标记的目标视频。这提出了两个主要挑战:(1)Web图像和视频帧之间的空间域移动; (2)图像和视频数据之间的模态差距。为了应对这些挑战,我们提出了自行车域的适应(CYCDA),这是一种基于周期的方法,用于通过在图像和视频中利用图像和视频中的联合空间信息来适应无监督的图像到视频域,另一方面,训练一个独立的时空模型,用于弥合模式差距。我们在每个周期中的两者之间的知识转移之间在空间和时空学习之间交替。我们在基准数据集上评估了图像到视频的方法,以及用于实现最新结果的混合源域的适应性,并证明了我们的循环适应性的好处。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
我们提出了信令评估有限状态机器的协议,即,评估在有限状态机器的提供者和输入字符串的提供者之间共享,以这样的方式既不是一方学习另一方的输入,并且被访问的州隐藏起来。对于字母表尺寸$ | \ sigma | $,状态$ | q | $和输入长度$ n $,以前的解决方案要么是$ n $或通信$ \ omega(n | \ sigma|| q | \ log | q |)$。我们的解决方案需要2轮通信$ O(n(| \ sigma | + | q | \ log | q |))$。我们为此问题提出了两个不同的解决方案,一个双方和一个不受信任但非勾结助手的设置。
translated by 谷歌翻译