Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for learning features from raw data. The drawback of the latter approach is that a vast amount of (labelled, typically) training data is required for learning. This paper advocates a new approach to learning visual descriptors for dense correspondence estimation in which we harness the power of a strong 3D generative model to automatically label correspondences in RGB-D video data. A fully-convolutional network is trained using a contrastive loss to produce viewpoint-and lighting-invariant descriptors. As a proof of concept, we collected two datasets: the first depicts the upper torso and head of the same person in widely varied settings, and the second depicts an office as seen on multiple days with objects rearranged within. Our datasets focus on re-visitation of the same objects and environments, and we show that by training the CNN only from local tracking data, our learned visual descriptor generalizes towards identifying non-labelled correspondences across videos. We furthermore show that our approach to descriptor learning can be used to achieve state-of-the-art single-frame localization results on the MSR 7-scenes dataset without using any labels identifying correspondences between separate videos of the same scenes at training time.
translated by 谷歌翻译