In this paper, we address the task of natural language object retrieval, tolocalize a target object within a given image based on a natural language queryof the object. Natural language object retrieval differs from text-based imageretrieval task as it involves spatial information about objects within thescene and global scene context. To address this issue, we propose a novelSpatial Context Recurrent ConvNet (SCRC) model as scoring function on candidateboxes for object retrieval, integrating spatial configurations and globalscene-level contextual information into the network. Our model processes querytext, local image descriptors, spatial configurations and global contextfeatures through a recurrent network, outputs the probability of the query textconditioned on each candidate box as a score for the box, and can transfervisual-linguistic knowledge from image captioning domain to our task.Experimental results demonstrate that our method effectively utilizes bothlocal and global information, outperforming previous baseline methodssignificantly on different datasets and scenarios, and can exploit large scalevision and language datasets for knowledge transfer.
translated by 谷歌翻译