Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing Wang Zhejiang Gongshang University0000-0001-7231-1260 , Jianfeng Dong Zhejiang Gongshang University0000-0001-5244-3274 , Tianxiang Liang Zhejiang Gongshang University , Minsong Zhang Zhejiang Gongshang University , Rui Cai Zhejiang Gongshang University and Xun Wang Zhejiang Gongshang University

2022

Abstract.

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 30th ACM International Conference on Multimedia; October 10–14, 2022; Lisboa, Portugal^†^†booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal^†^†price: 15.00^†^†doi: 10.1145/3503161.3548003^†^†isbn: 978-1-4503-9203-7/22/10^†^†ccs: Information systems Multimedia and multimodal retrieval

1. Introduction

With the rapid emergence of images and videos on the Internet such as on Facebook and TikTok, it has brought great challenges to retrieve multimedia contents accurately (Wei et al., 2019; Liu et al., 2022; Xiao et al., 2021; Liu et al., 2021b). Recently, the task of cross-modal retrieval (Li et al., 2021; Liu et al., 2021a; Miech et al., 2021; Cao et al., 2022; Gabeur et al., 2022; Ali et al., 2022; Lu et al., 2019; Wang et al., 2022a; Cui et al., 2019; Yang et al., 2020) has received growing research attention. The majority of research work on cross-modal retrieval is dedicated to the English language, due to the availability of a large quantity of human-labeled data. With this regard, cross-lingual cross-modal retrieval (CCR), especially the one transferring from the source language with rich resources (e.g., English) to the target language where the human-labeled data is scarce or even not available, is of great importance. Figure 1 shows a pipeline of CCR, where human-labeled target-language data is not provided during training.

Figure 1. A pipeline of cross-lingual cross-modal retrieval (CCR). A CCR model is trained on a collection of videos/images associated with captions in the source language (English). During evaluation, a query in the target language (Chinese) is given to retrieve relevant videos.

Although many recent works (Yu et al., 2021; Elliott and Kádár, 2017; Wehrmann et al., 2019; Kim et al., 2020; Burns et al., 2020; Song et al., 2021a; Lei et al., 2021a; Li et al., 2016) have achieved remarkable progress in multilingual scenarios, their approaches cannot be applied to CCR directly, since they require large-scale labeled datasets covering all interested languages. Unfortunately, existing human-labeled resources for low-resource languages (e.g., Czech) are rather limited, and it is extremely expensive and time-consuming to manually annotate videos (or images) with descriptions in multiple languages. However, with MT gaining popularity, research on MT is constantly growing recently (Park et al., 2021; Philip et al., 2021; Cui et al., 2021) and has shown the potential to overcome above problem. In this paper, we focus on cross-lingual cross-modal retrieval utilizing MT.

To the best of our knowledge, only a few recent works (Portaz et al., 2019; Aggarwal and Kale, 2020; Ni et al., 2021; Zhou et al., 2021; Huang et al., 2021a) have paid attention to cross-modal retrieval in the cross-lingual setting. Among them, Portaz et al. (Portaz et al., 2019) and Aggarwal et al. (Aggarwal and Kale, 2020) heavily rely on pre-trained multilingual word embeddings and pre-trained sentence encoders, respectively. Instead, MMP (Huang et al., 2021a), M3P (Ni et al., 2021) and UC $^{2}$ (Zhou et al., 2021) perform multilingual pre-training on their own, among which UC $^{2}$ constructs a multilingual vision and language (V+L) dataset with MT for pre-training and achieves the best performance. Although they have proved that multilingual pre-training is beneficial to CCR, they only use source-language data for fine-tuning in the cross-lingual setting, which hurts the cross-lingual ability of models. Besides, we empirically observe that outputs of MT are far from being perfect, and things get worse when input sentences are complicated, which has been ignored in previous works. As exemplified in Figure 1, the red Chinese words are the incorrect translation of underlined English words, where translation noise is introduced when translating the English caption into Chinese using Google Translate. Due to the existence of such noise, these machine-translated sentences are usually weakly-correlated with the corresponding visual content. Straightforward application of cross-modal matching to weakly-correlated data will result in corrupted representations and yield deteriorated performance, as neural models have a strong capacity to fit to the given (noisy) data.

To conquer the obstacle of exploiting machine-translated sentences, we propose noise-robust learning in this paper. Instead of revising the translated sentences to improve their quality, we introduce the noise-robust representation learning based on multi-view self-distillation to generate soft pseudo-targets. Specifically, we employ the cross-attention in Transformer to gather information from tokens that are more likely to be correctly translated according to the source-language counterparts, and filter out others. The output of the cross-attention is not only aligned with the source language, but also relatively clean than only using translated sentences. Thus we use the representation in the cross-attention module to generate the soft pseudo-targets, which provides direct supervision for the target-language encoding. Furthermore, inspired by the back-translation in unsupervised MT (Artetxe et al., 2017; Huang et al., 2021b), we utilize the cycle semantic consistency to minimize the semantic discrepancies between source sentences and back-translated sentences to further improve the noise robustness of the textual encoder.

The main contributions of this paper are summarized in four aspects: 1) We propose noise-robust learning to solve the noise problem caused by MT for CCR. By resorting to MT, we conduct cross-lingual transfer without relying on the manually annotated target-language data. To the best of our knowledge, this work is the ﬁrst work focusing on the noise problem caused by MT for CCR. 2) To reduce the impact of translation noise, we design a novel noise-robust learning framework that introduces the multi-view distillation with cross attention to generate soft pseudo-targets and provide direct supervision to learn the robust vision-target language matching. Further, we utilize the cycle semantic consistency to strengthen the capability of extracting semantic information from noisy sentences. 3) We construct a bilingual dataset MSR-VTT-CN, a new cross-lingual video-text retrieval dataset extending MSR-VTT with manually translated Chinese sentences for evaluation. 4) Extensive experiments are conducted on two cross-lingual video-text retrieval datasets, i.e., VATEX and MSR-VTT-CN, and a cross-lingual image-text retrieval dataset Multi-30K, where our proposed model achieves a new state-of-the-art.

2. Related Work

2.1. Cross-modal Retrieval

Video-Text Retrieval. The majority of recent works on video-text retrieval have focused on two types of approaches, namely concept-based approaches (Dalton et al., 2013; Chang et al., 2015; Markatopoulou et al., 2017; Habibian et al., 2016) and embedding-based approaches (Song and Soleymani, 2019; Gabeur et al., 2020; Han et al., 2021; Yang et al., 2021; Liu et al., 2021a; Wu et al., 2021; Luo et al., 2021b; Wei et al., 2021; Song et al., 2021b; Wang et al., 2022b). Among them, embedding-based approaches currently dominate video-text retrieval, which measure similarities between text and videos by projecting them into single or multiple common spaces. For instance, Gabeur et al. (Gabeur et al., 2020) design a multi-modal Transformer for video encoding and use BERT (Devlin et al., 2019) to obtain embeddings for text. Besides, some recent models (Huang et al., 2017; Zhu and Yang, 2020; Luo et al., 2021a; Lei et al., 2021b) convert the text-to-video retrieval into a binary classification task, where models predict if text and videos match or not based on the fusion of their embeddings.

Image-Text Retrieval. Similar to video-text retrieval, most existing methods for image-text retrieval (Liu et al., 2017; Wang et al., 2018; Li et al., 2019b; Zeng et al., 2021; Anwaar et al., 2021; He et al., 2021; Gan et al., 2020) aim to learn image and text features in a shared embedding space where similarities can be easily calculated. For example, Anwaar et al. (Anwaar et al., 2021) introduce an auto-encoder-based model to learn the composed representation of images and text queries, which are mapped to the target image space to learn a similarity metric. Recently, there are some pre-trained models (Hong et al., 2021; Li et al., 2020a, b) seeking to learn vision-language joint representations based on self-attention or cross-attention mechanism. These methods take a deep dive into the interaction between different modalities, and demonstrate advantageous retrieval performances.

Despite of performance improvements brought by above methods, their models are designed and trained in a monolingual setting, and therefore lack the cross-lingual ability.

Illustration of the proposed NRCCR method for CCR.
The source-language branch — Figure 2. Illustration of the proposed NRCCR method for CCR. The source-language branch $E^{S}$ and target-language branch $E^{T}$ make up of the Basic Model which aligns the video (or image) with source- and target-language input sentences, and a cross-attention module is adopted for noise-robust representation learning via self-distillation. The leftmost branch encodes the back-translated sentence which is supervised by $E^{S}$ with cycle semantic consistency. All textual branches is built upon pretrained multilingual BERT, which is trained to learn language-agnostic representations in an adversarial fashion.

2.2. Cross-lingual Cross-modal Retrieval

Early works (Portaz et al., 2019; Aggarwal and Kale, 2020) on CCR adapt existing vision-language methods to the cross-lingual setting. Portaz et al. (Portaz et al., 2019) resorts to non-contextualized multilingual word embeddings (BIVEC (Luong et al., 2015) and MUSE (Conneau et al., 2017)), which are pre-trained by aligning words in different languages. Aggarwal et al. (Aggarwal and Kale, 2020) rely on pre-trained sentence encoders (mUSE (Yang et al., 2019) and LASER (Artetxe and Schwenk, 2019)) to embed sentences from different languages into a common space. Recently, with the release of large-scale multi-modal datasets (Jia et al., 2021; Sharma et al., 2018), some V+L pretraining models (Ni et al., 2021; Huang et al., 2021a; Zhou et al., 2021; Fei et al., 2021) are proposed to further narrow the gap between different languages and modalities. Fei et al. (Fei et al., 2021) propose a cross-lingual cross-modal pretraining framework, which can be used for the downstream monolingual cross-modal retrieval task by finetuning on manually annotated target-language data. Ni et al. (Ni et al., 2021) propose a Transformer-based encoder and two pre-training objectives to learn explicit alignments between vision and different languages, which use a multilingual-monomodal corpus and a monolingual-multimodal corpus to alleviate the issue of lacking enough non-English labeled data. Following (Ni et al., 2021), Huang et al. (Huang et al., 2021a) design a Transformer-based pre-trained model MMP for learning contextualized multilingual multi-modal representations, which are pre-trained using Multi-HowTo100M. Considering the fact that large-scale multi-modal datasets tend to heavily skewed towards well-resourced languages, the scarcity problem of aligned multilingual data remains a great challenge. To solve this problem, Zhou et al. (Zhou et al., 2021) propose a MT-augmented cross-lingual cross-modal pre-training framework UC $^{2}$ and construct a multilingual V+L corpus. Although this work has made a breakthrough in CCR, it ignores the fact that translation results tend to be quite noisy, which would inevitably hurt model performances on low-resource languages. To alleviate this, in this paper, we focus on improving the robustness of CCR model against noise brought by MT.

3. Methods

In this section, we present our Noise-Robust Cross-lingual Cross-modal Retrieval (NRCCR) model. As shown in Figure 2, it includes a source-language branch ( $E^{S}$ ) and a target-language branch ( $E^{T}$ ) responsible for source- and target-language text encoding, respectively. Besides, our model is equipped with a cross-attention module, which plays a key role in filtering out translation noise. To further improve its robustness, we use an additional branch (the leftmost branch in Figure 2) to encode back-translated sentences and then perform cycle semantic consistency learning. Adversarial training is also introduced to extract language-agnostic features. In what follows, we first introduce the task definition and a basic model for CCR, followed by the description of our proposed model.

3.1. Task Definition

We first formally define the setting of cross-lingual cross-modal retrieval (CCR). It involves two languages, namely the source language $S$ and the target language $T$ . For the the source language $S$ , we have a collection of human-labeled training data $D^{S} = {d_{1}, d_{2}, . . ., d_{n}}$ , where each instance $d_{i}$ consists of a caption $s_{i}^{S}$ paired with an image or video $v_{i}$ . As for the target language $T$ , due to the scarcity of human-labeled data, we assume there are no resources for it but some external tools, such as Google Translate. The core task of CCR is to obtain a model applicable in the target language $T$ , without using any manually annotated target-language data.

3.2. A Basic Model for CCR

Input data. Since human-labeled training data is only available for the source language, we automatically translate the gold-standard source dataset $D^{S}$ into the target language. Concretely, for each instance in $D^{S}$ , we translate the caption $s^{S}$ into a target-language sentence $s^{T}$ , utilizing MT¹¹1In this paper, we use Google Translate (https://translate.google.cn).. In this way, we construct a collection of triplet $D^{T} = {I_{1}, I_{2}, . . ., I_{n}}$ , where each instance $I_{i} = (v_{i}, s_{i}^{S}, s_{i}^{T})$ consists of a video (or an image) $v_{i}$ , a source-language caption $s_{i}^{S}$ and a translated target-language caption $s_{i}^{T}$ .

Visual encoder. Given a video $v$ , we first extract a sequence of representations $U = {u_{1}, u_{2}, \dots, u_{l}}$ using pre-trained 2D CNN, where $u_{i} \in R^{d_{u}}$ denotes the representation of $i$ -th frame. Then, we feed it into a Transformer block, to enhance the frame-inter interaction and generate a video feature vector $v \in R^{d_{v}}$ :

(1)

\parv=f(Transformerv(U))\par

where $f (\cdot)$ denotes the operation of average pooling.

When input is an image, similar with (Portaz et al., 2019; Aggarwal and Kale, 2020), we use the ResNet-152 (Xie et al., 2017) pretrained on ImageNet(Deng et al., 2009) dataset as our backbone, and keep it frozen. We use the output of last average pooling layer in ResNet-152 into a trainable FC layer to obtain the image feature vector $e \in R^{d_{e}}$ . In the following, for the ease of description, we take video-text retrieval as an example to depict our methods in detail.

Textual encoder. We design a two-branch encoder for CCR, which consists of a source-language branch and a target-language branch. As illustrated in Figure 2, each branch includes a Transformer block built upon the pre-trained multilingual BERT (mBERT (Devlin et al., 2019)). To extract the shared information between different languages, both branches are weight-sharing. The inputs of them are pseudo-parallel sentence pairs ${s^{S}, s^{T}}$ . Specifically, given a source-language sentence $s^{S}$ and a target-language sentence $s^{T}$ , we use mBERT to generate a sequence of multilingual representations $m^{S} \in R^{N \times d_{w}}$ and $m^{T} \in R^{M \times d_{w}}$ containing $N$ and $M$ tokens, respectively. Then, we feed $m^{S}$ and $m^{T}$ to a high-level Transformer block to further extract task-specific features:

(2)

\begin{matrix} c^{S} = f (T r a n s f o r m e r_{t} (m^{S})), c^{T} = f (T r a n s f o r m e r_{t} (m^{T})) \end{matrix}

where $c^{S}$ and $c^{T}$ denote the sentence embedding of $s^{S}$ and $s^{T}$ . For the ease of reference, we term the source-language branch as $E^{S}$ and the target-language branch as $E^{T}$ :

(3)

\begin{matrix} c^{S} = E^{S} (s^{S}), & c^{T} = E^{T} (s^{T}) \end{matrix}

Mapping into a common space. To promote alignments of the sentence in different languages with the corresponding video $v$ , we try to learn two mappings $g_{t} (\cdot)$ and $g_{v} (\cdot)$ to respectively project the different language representations and the video representation into a common space as follows:

(4)

\begin{matrix} {^c}^{S} = g_{t} (c^{S}), {^c}^{T} = g_{t} (c^{T}),^v = g_{v} (u) \end{matrix}

where ${^c}^{S}$ , ${^c}^{T}$ and ${^v}^{T}$ are source-language sentence embedding, target-language sentence embedding and video embedding, respectively. Besides, each mapping is implemented as a multi-layer perceptron. To make relevant video-sentence pairs near and irrelevant pairs far away in the common space, an improved triplet ranking loss (Faghri et al., 2018; Dong et al., 2019) is employed which penalizes the model according to the hardest negative examples in the mini-batch:

(5)	$L^{S} ({^c}^{S},^v)$	$= m a x (0, △ + s i m (^v, {^c}^{S -}) - s i m (^v, {^c}^{S}))$
		$+ m a x (0, △ + s i m ({^c}^{S}, {^v}^{-}) - s i m ({^c}^{S},^v))$
	$L^{T} ({^c}^{T},^v)$	$= m a x (0, △ + s i m (^v, {^c}^{T -}) - s i m (^v, {^c}^{T}))$
		$+ m a x (0, △ + s i m ({^c}^{T}, {^v}^{-}) - s i m ({^c}^{T},^v))$

where $△$ indicates a margin constant and $s i m (\cdot)$ denotes the similarity function, e.g. cosine similarity, while ${^c}^{S -}$ , ${^c}^{T -}$ and ${^v}^{-}$ respectively indicate the embedding of a hard negative source-language sentence sample, a hard target-language sentence sample and a negative video sample. The overall triplet loss is computed as $L_{t r i} = L^{S} + α L^{T}$ , where $α$ is a scalar whose value is set empirically.

An example of the cross-attention map.
The red represents the incorrect translated word of the underlined source-language word. — Figure 3. An example of the cross-attention map. The red represents the incorrect translated word of the underlined source-language word.

3.3. Our Proposed NRCCR

On the base of Basic Model, we additionally introduce noise-robust representation learning, cycle semantic consistency and language-agnostic representation learning, resulting in our proposed NRCCR.

3.3.1. Noise-Robust Representation Learning

In order to improve the noise robustness of our model, we devise a cross-attention module and multi-view self-distillation.

Cross-attention module. Since pseudo-parallel sentence pairs are constructed with MT, this process would inevitably introduce translation noise, such as inaccurate translation and grammatical mistakes. However, in Basic Model, no matter how noisy $s^{T}$ is, minimizing $L^{T}$ always closes the distance between embeddings of $s^{T}$ and $v$ , which would result in corrupted representations and thereby compromise the retrieval performance. To address this, we propose to generate relatively clean target-language representations to guide the training of the target-language branch via multi-view self-distillation. Therefore, we introduce a cross-attention module $C r o s s A T T$ to collect information from tokens in $s^{T}$ that are more likely to be correctly translated according to the source-language counterparts, and filter out others. The cross-attention is derived from self-attention, and the difference is that the input of cross-attention module comes from different languages. Specifically, given a parallel sentence pair $(s^{S}, s^{T})$ , we first feed $s^{S}$ and $s^{T}$ into pre-trained mBERT and obtain their token-level representations $m^{S}$ and $m^{T}$ respectively. The cross-attention module computes as follows:

(6)	$h$	$= C r o s s A T T (m^{S}, m^{T}, m^{T})$
		$= s o f t m a x (\frac{W_{Q} m^{S} \cdot (W_{K} m^{T})^{T}}{\sqrt{d_{w}}}) \cdot W_{V} m^{T}$
	$h^{C}$	$= N o r m (F F N (h))$

where $W_{Q}, W_{K}$ and $W_{V}$ are three projection matrices for transforming the input features, $F F N$ and $N o r m$ indicate the feed-forward networks and layer normalization operation in $T r a n s f o r m e r_{t}$ , respectively. The output $h$ is calculated by multiplying value with attention weights, which comes from the similarity between the corresponding query and all keys. As a result, among all tokens in $s^{T}$ , the token that is more similar to the query of $s^{S}$ would hold a larger attention weight and contribute more to the output, otherwise, the cross-attention would assign a low weight to the dissimilar tokens which are noisy. Figure 3 gives an example, where the verb ”rope” is incorrectly translated as a noun ”string” by Google Translate. Hence, the attention weights between the incorrect word (marked in red) and source-language words are assigned low values. In this way, $C r o s s A T T$ automatically gathers relatively clean information from the target-language sentence $s^{T}$ , improving the quality of the output $h^{C}$ .

Multi-view self-distillation module. To alleviate even eliminate the influence of the noise generated by MT, we design a multi-view self-distillation loss, which guides the training of target-language branch from the similarity-based view and feature-based view, respectively.

Similarity-based view: The key to cross-modal retrieval is measuring the similarity between different modalities. Here we regard the cross-attention module as a teacher. Its outputs take part in the calculation of cross-modal similarities, which are then used as pseudo-targets to supervise the target-language branch learning. Specifically, we first project the representation $h^{C}$ into the common space, obtaining a sentence embedding expressed as ${^h}^{C} = g_{t} (f (h^{C}))$ . Then, we calculate the softmax-normalized text-to-video similarity $q^{t 2 v} ({^h}^{C},^v)$ and video-to-text similarity $q^{v 2 t} (^v, {^h}^{C})$ as:

(7)

\begin{matrix} q^{t 2 v} ({^h}^{C},^v) = \frac{e x p (s i m ({^h}^{C},^v) / τ)}{\sum_{j = 1}^{B} e x p (s i m ({^h}^{C}, {^v}_{j}) / τ)} q^{v 2 t} (^v, {^h}^{C}) = \frac{e x p (s i m (^v, {^h}^{C}) / τ)}{\sum_{j = 1}^{B} e x p (s i m (^v, {^h}_{j}^{C}) / τ)} \end{matrix}

where $B$ denotes the size of a minibatch from $D^{T}$ , and $τ$ is the temperature coefficient. For the target-language sentence $s^{T}$ , we similarly compute the softmax-normalized text-to-video similarity $p^{t 2 v} ({^c}^{T},^v)$ and video-to-text similarity $p^{v 2 t} (^v, {^c}^{T})$ similarity. Finally, the Kullback Leibler (KL) divergence is employed as the similarity-based distillation loss:

(8)		$L_{s i m} = \frac{1}{2} [$	$K L (q^{t 2 v} ({^h}^{C},^v) \| \| p^{t 2 v} ({^c}^{T},^v))$
(8)		$+$	$K L (q^{v 2 t} (^v, {^h}^{C}) \| \| p^{v 2 t} (^v, {^c}^{T}))]$

Feature-based view: Another supervision from the cross-attention module is on a feature-based view, where the output features of the teacher are utilized to guide the student’s learning process. Here we use the feature-based loss $L_{f e a t}$ to introduce inexplicit knowledge in the sentence embedding ${^h}^{C}$ to the encoding of target-language sentence $s^{T}$ . As shown in Equation 9, $L_{f e a t}$ is defined as the $L 1$ distance between ${^h}^{C}$ and ${^c}^{T}$ :

(9)

\begin{matrix} L_{f e a t} = | | f ({^h}^{C}) - f ({^c}^{T}) | |_{1} \end{matrix}

3.3.2. Cycle Semantic Consistency

As we would use source-language sentences generated by MT as the source-branch input during inference, it is necessary to enhance the noise robustness of our model on the source language. To this end, inspired by the back-translation applied to unsupervised machine translation (Artetxe et al., 2017; Huang et al., 2021b), we propose to improve the cycle semantic consistency between sentences and their back-translated counterparts. Concretely, given a source-language sentence $s^{S}$ , we first utilize MT to translate it into the target language and then back again to obtain a source-language sentence $s^{B}$ , i.e., $s^{B} = B a c k T r a n s l a t i o n (s^{S})$ . For sentence $s^{S}$ , the source-language branch $E^{S}$ is expected to extract consistent semantic information from the back-translated sentence $s^{B}$ with original sentence $s^{S}$ in the translation cycle, i.e., $E^{S} (s^{S}) \approx E^{S} (s^{B})$ . Similar with Equation 5, we can incentivize this behavior using an improved triplet ranking loss:

(10)		$L_{c y c} (s^{S}, s^{B})$	$= m a x (0, m + s i m ({^c}^{B}, {^c}^{S -}) - s i m ({^c}^{B}, {^c}^{S}))$
(10)			$+ m a x (0, m + s i m ({^c}^{S}, {^c}^{B -}) - s i m ({^c}^{S}, {^c}^{B}))$

where ${^c}^{B} = E^{S} (s^{B})$ . Although $s^{S}$ and $s^{B}$ are in the same language, $s^{B}$ is even noisier than $s^{T}$ as translation noise tends to be amplified during back-translation. Therefore, minimizing $L_{c y c}$ can further improve the robustness of our model to against noise and produce semantic consistency features. Besides, back-translation can be also thought of as a data-augmentation technique, which could improve the diversity of words.

3.3.3. Language-agnostic Representation Learning

Considering language-specific features lack the cross-lingual transfer ability, we train the textual encoder in an adversarial fashion to generate language-agnostic representations. Specifically, we construct a language classifier $F$ to act as the discriminator and is adopted to distinguish the source and target languages. The language classifier is consisted of a multi-layer feed-forward neural networks, and the discriminator loss in adversarial training is defined as:

(11)

L_{a d v} = \frac{1}{| D^{T} |} \sum I_{i} \in D^{T} [l o g F (f (m_{i}^{S})) + l o g (1 - F (f (m_{i}^{T})))]

The text encoder $E^{S}$ and $E^{T}$ aim to generate language-agnostic representations and confuse the discriminator $F$ , while discriminator $F$ aims to distinguish between target-language sentences and source-language sentences.

3.3.4. Training and Inference

Our model is trained by minimizing the combination of the above losses. To sum up, the total loss function is defined as:

(12)

where $λ_{1}, λ_{2}, λ_{3}, λ_{4}$ are all hyper-parameters to balance the importance of each loss.

After the model has been trained, given a sentence query in the target language, we sort candidate videos/images in descending order according to their cross-modal similarity with the query. To compute the similarity between a video (or an image) $v$ and a sentence query in the target language $s^{T}$ , we first translate $s^{T}$ to the source-language sentence $s^{S}$ . The similarity between $v$ and $s^{T}$ is computed as the sum of their similarity and the corresponding similarity between $v$ and the translated sentence $s^{S}$ from $s^{T}$ . Formally, the final similarity is computed as:

(13)

S c o r e (v, s^{T}) = β s i m (^v, {^c}^{T}) + (1 - β) s i m (^v, {^c}^{S})

where $β$ is a empirically-set scalar weight.

4. Experiment

Method	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Method	R@1	R@5	R@10	Med r	mAP	R@1	R@5	R@10	Med r	mAP	SumR
MMP w/o pre-train(Huang et al., 2021a)	23.9	55.1	67.8	-	-	-	-	-	-	-	-
MMP (Huang et al., 2021a)*	29.7	63.2	75.5	-	-	-	-	-	-	-	-
W2VV (Dong et al., 2018)	5.46	12.3	15.4	298	9.20	-	-	-	-	-	-
VSE++ (Faghri et al., 2018)	21.1	48.1	59.6	6.0	33.72	34.9	67.2	77.5	3.0	21.76	307.7
W2VV++ (Li et al., 2019a)	20.5	48.3	59.5	6.0	33.40	-	-	-	-	-	-
Miech et al. (Miech et al., 2019)	13.7	37.1	50.4	10.0	25.32	23.2	53.1	67.3	5.0	14.62	244.8
CE (Liu et al., 2019)	20.2	48.5	60.8	6.0	33.48	31	63.5	76.7	3.0	20.2	300.7
Dual Encoding (Dong et al., 2019)	23.1	52.1	62.6	5.0	36.32	35.6	67.9	79.3	3.0	23.98	320.5
HGR (Han et al., 2021)	14.3	37.0	47.5	12.0	25.10	27.8	61.0	72.9	4.0	15.5	261.0
GPO (Chen et al., 2021)	19.5	46.9	57.8	6.0	32.20	33.8	65.9	77.3	3.0	20.21	301.3
RIVRL (Dong et al., 2022)	26.8	56.5	67.3	4.0	40.35	38.3	71.1	80.9	2.0	26.62	340.9
NRCCR	30.4	65.0	75.1	3.0	45.64	40.6	72.7	80.9	2.0	32.40	364.7

Table 1. Performance comparison of cross-lingual video-text retrieval on VATEX (the source language is English and the target language is Chinese). Symbol asterisk (*) indicates the model is pre-trained on Multi-HowTo100M (Huang et al., 2021a).

4.1. Experimental Settings

Datasets. We conduct experiments on two public multilingual cross-modal retrieval datasets (VATEX (Wang et al., 2019) and Multi30K (Elliott et al., 2016)) and MSR-VTT-CN constructed by ourselves.

[leftmargin=*]
VATEX (Wang et al., 2019): VATEX is a bilingual video-caption dataset, which contains over 41,250 videos and 825,000 captions. Each video corresponds to 10 English sentences and 10 Chinese sentences describing the video content. Following (Dong et al., 2021; Chen et al., 2020), we use 25,991 video clips for training, 1,500 clips for validation and 1,500 clips for test.
Multi30K (Elliott et al., 2016): This dataset contains 31K images, which is built upon Flickr30K (Young et al., 2014). It provides five captions per image in English and German and one caption per image in French and Czech. Following Flickr30K (Young et al., 2014), we split the data 29000/1014/1000 as train/dev/test sets.
MSR-VTT-CN: MSR-VTT (Xu et al., 2016) is a monolingual dataset which has been widely used for video-text retrieval. We build MSR-VTT-CN by extending MSR-VTT to a bilingual version, which follows the partition of (Yu et al., 2018) containing 9,000 and 1,000 for the training and testing, respectively. For each video in the training set of MSR-VTT, we employ Google Translate to automatically translate its captions from English to Chinese. MSR-VTT-CN test set is obtained by manually translating the test set of MSR-VTT into Chinese. To ensure the quality of manual-annotated data, similar to (Lan et al., 2017), we hire ten Chinese native speakers who are also good at English (passed the National College English Test 6). Since some English captions are ambiguous, native speakers also take video content into consideration to aid translation.

Method	T2V		V2T		SumR
Method	R@10	mAP	R@10	mAP	SumR
W2VV (Dong et al., 2018)	22.9	1.50	-	-	-
VSE++ (Faghri et al., 2018)	54.0	29.57	55.0	29.86	230.8
Miech et al. (Miech et al., 2019)	52.9	25.95	48.9	25.31	203.0
Dual Encoding (Dong et al., 2019)	56.6	31.75	57.1	32.25	244.1
CE (Liu et al., 2019)	63.4	-	62.7	-	265.4
SEA (Li et al., 2020c)	61.1	33.80	42.4	21.90	215.8
HGR (Han et al., 2021)	53.3	27.18	53.3	27.79	219.5
GPO (Chen et al., 2021)	53.2	29.86	52.9	29.15	226.5
RIVRL (Dong et al., 2022)	63.0	37.03	63.5	35.46	273.7
NRCCR	67.4	41.93	67.2	43.00	307.3

Table 2. Performance comparison on MSR-VTT-CN.

Evaluation metrics. Following the previous works (Dong et al., 2022; Li et al., 2019a), we use rank-based metrics, namely $R @ K$ ( $K = 1, 5, 10$ ), Median rank (Med r) and mean Average Precision (mAP) to evaluate the performance. $R @ K$ is the fraction of queries that correctly retrieve desired items in the top $K$ of ranking list. Med r is the median rank of the first relevant item in the search results. Higher $R @ K$ , mAP and lower Med r mean better performance. For overall comparison, we report the Sum of all Recalls (SumR).

Method

English -¿ German

English -¿ French

English -¿ Czech

T2I

I2T

SumR

T2I

I2T

SumR

T2I

I2T

SumR

R@10

Backbone with frozen ResNet152:

ISUM (Portaz et al., 2019)

44.20

46.00

TZClIR (Aggarwal and Kale, 2020)

48.80

54.80

NRCCR

68.80

67.30

306.8

67.40

69.30

306.1

67.00

66.20

301.3

Pre-trained on large-scale V+L datasets:

MMP (Huang et al., 2021a)

72.60

78.20

^{3}

P (Ni et al., 2021)

351.0

276.0

220.8

^{2}

(Zhou et al., 2021)

449.4

444.0

407.4

CLIP (Radford et al., 2021) + Ours

88.90

89.20

450.8

89.2

89.7

452.9

87.9

87.8

438.7

Table 3. Performance comparison of cross-lingual image-text retrieval on Multi-30K.

4.2. Performance Comparison

4.2.1. Cross-lingual Video-Text Retrieval

To the best of our knowledge, MMP (Huang et al., 2021a) is the only work specifically designed for cross-lingual video-text retrieval. To further prove the effectiveness of our mode, following (Zhou et al., 2021),we make the comparison with the traditional monolingual video-text retrieval methods by translating target-language testing samples into the source language with MT, which are typically regarded as the baselines for cross-lingual cross-modal retrieval. Specifically, we compare following monolingual video-text retrieval methods including W2VV (Dong et al., 2018), VSE++ (Faghri et al., 2018), W2VV++ (Li et al., 2019a), Miech et al. (Miech et al., 2019), CE (Liu et al., 2019), Dual Encoding (Dong et al., 2019), HGR (Han et al., 2021), GPO (Chen et al., 2021), and RIVRL (Dong et al., 2022), considering their source code available.

Experiments on VATEX. In the cross-lingual setting, we use VATEX English training set as the source-language labeled data, and VATEX Chinese test set for target-language evaluation. As shown in Table 1, our method outperforms MMP without pre-training with a clear margin, and achieves comparable results with MMP which is pre-trained on a large-scale dataset Multi-HowTo100M (Huang et al., 2021a) (the multilingual version of Howto100M (Miech et al., 2019)). As for the adapted monolingual models, their performances are all overwhelmed by our model. In addition, we also try replacing Google MT with Baidu MT, and achieve a better result (SumR=375.2). These result demonstrates the compatibility of our model with different MTs.

Experiments on MSR-VTT-CN. Similarly, in Table 2, we report experimental results on MSR-VTT-CN. Although models for comparison are all carefully designed and could achieve high performances when test input is clean, they still are no match for our model. Since the SOTA method MMP (Huang et al., 2021a) only performs experiments on VATEX and remains closed-source, its performance on MSR-VTT-CN is not reported.

4.2.2. Cross-lingual Image-Text Retrieval

To verify the cross-lingual transfer ability of our model for text-image retrieval, we conduct experiments on Mutlti30K. As can be seen in Table 3, SOTA methods are divided into two groups. The first group includes methods that use the same image backbone (frozen ResNet152). ISUM (Portaz et al., 2019) and TZCIR (Aggarwal and Kale, 2020) rely on multilingual word embeddings and sentence encoders to obtain multilingual sentence representations, which are then simply mapped into a common space with corresponding image representations. Results in Table 3 show that NRCCR consistently outperforms them with a clear margin.

The second group includes models performing large-scale V+L pertaining. Among them, MMP (Huang et al., 2021a) is pre-trained on multilingual Howto100M, M $^{3}$ P (Ni et al., 2021) is pre-trained on Wikipedia and Conceptual Captions (Sharma et al., 2018), and UC $^{2}$ (Zhou et al., 2021) is pre-trained on multilingual Conceptual Captions which was extended by MT. We use an open-source pre-trained model CLIP (Radford et al., 2021) which is pre-trained on Wikipedia as the image encoder. With the help of CLIP, our model achieves a significant improvement on all languages in this dataset. On French and Czech, our model significantly outperforms the latest-proposed method UC $^{2}$ , which also resorts to MT but ignores the fact that translation results are noisy. These results not only verify the effectiveness of noise-robust learning on multiple languages, but also demonstrate its compatibility with stronger visual encoders.

4.3. Robustness Analysis

Due to the usage of MT during training and test, translation noise is introduced in both stages. In this section, we analyze the robustness of our model against translation noise during training and test, respectively. For a fair comparison, we compare our model to Basic Model, which can be regarded as our NRCCR without noise-robust learning modules.

Noise robustness during training. To further prove the robustness of our model against translation noise, we conduct the experiment with various degrees of training noise on VATEX. We evaluate models in the following two scenarios: Training models with data obtained by performing English-to-Chinese MT (EN $\to$ ZH); Training models with data obtained by applying MT two more times (EN $\to$ ZH $\to$ EN $\to$ ZH). For ease of reference, we term the former as GoogleTrans, and the latter as GoogleTrans++. We argue that the training data used in the latter are noisier. As shown in Figure 4, after multiple-time translation being used, the performance of Basic Model drops severely. By contrast, the performance of our model is more stable, which verifies the effectiveness of our noise-robust representation learning.

Figure 4. Performances comparison on various degrees of noise during training. GoogleTrans++ means using Google Translate two more times, showing more noisy training data.

Figure 5. Performances of Basic Model and our NRCCR on queries of varied lengths. The number over the bins is the improvement of NRCCR over Basic Model. $d %$ represents the percentage of sentences with current length in VATEX.

$L_{s i m}$	$L_{f e a t}$	$L_{c y c}$	$L_{a d v}$	T2V		V2T		SumR
$L_{s i m}$	$L_{f e a t}$	$L_{c y c}$	$L_{a d v}$	R@10	mAP	R@10	mAP	SumR
				71.7	44.15	79.1	31.43	352.2
✓				72.8	44.52	79.9	31.10	354.6
	✓			73.8	44.30	80.4	31.50	356.4
✓	✓			74.2	44.85	80.0	32.74	360.2
✓	✓	✓		74.4	45.31	80.6	31.83	361.4
✓	✓	✓	✓	75.1	45.64	80.9	32.40	364.7

Table 4. Ablation studies on VATEX to investigate the effectiveness of each model component.

Noise robustness during test. In this experiment, we group target-language sentences in the VATEX test set into 6 sets according to their lengths and report the R@5 scores of text-to-video retrieval on each set. We assume that the length of the input sentence is directly proportional to the difficulty of translation, as translating long sentences is more likely to introduce noise. Figure 5 shows the performance of Basic Model and our model on queries with varied lengths. Our model outperforms Basic Model except on the first two sets, where the number of sentences is too small and the results lack statistical significance. Besides, we observe that the gap between our model and Basic Model widens when input sentences become longer, which further verifies the noise robustness of our model.

4.4. Ablation Studies

In this section, we investigate the contribution of each component in NRCCR. The results on VATEX are summarized in Table 4, where the first row reports the performance of Basic Model. Compared to Basic Model, we empirically observe that self-distillation from the similarity-based view and feature-based view improves the performance by 2.4 and 4.2 on SumR metric, respectively. In addition, performance is further improved by using both of them (the sumR increases from 352.2 to 360.2), which suggests the complementarity between these two views. By additional minimizing the cycle semantic consistency loss $L_{c y c}$ between source sentences and back-translated sentences, our model achieves an extra performance gain. Finally, performing adversarial training is also beneficial to the overall performance, indicating the importance of learning language-agnostic representations.

4.5. Analysis by t-SNE Visualization

To further analyze why our method yields better performance than Basic Model, we randomly select 20 samples from the Chinese test set of VATEX and visualize their representations produced by Basic Model (shown in Figure 6(a)) and NRCCR (shown in Figure 6(b)), respectively. For each sample, it consists of a video, 10 Chinese sentences and 10 corresponding translated English sentences. Compared with Basic Model, the intra-class representations in NRCCR are relatively more compact, which reveals that the language-agnostic representations have been learned and our model achieves a better cross-lingual cross-modal alignment. It is worth noting that the English sentences obtained by MT are noisy, but the English representations are still closely clustered with Chinese representations in Figure 6(b). This result to some extent demonstrates that our method is able to restore the original semantic information from noisy English sentences, and the reason might be that our method has been optimized by cycle semantic consistency learning.

The t-SNE results produced by — (a) Basic Model

5. Summary and Conclusions

This paper proposes NRCCR, a noise-robust learning framework for CCR. We resort to MT to achieve cross-lingual transfer, and effectively alleviate the noise problem caused by MT in both training and testing stages via noise-robust learning. We conduct the experiments on two multilingual video-text datasets and a multilingual image-text dataset, outperforming previous CCR methods. The experimental results show that our method can effectively enhance the robustness of the model, and achieves significant improvement on alignment quality of cross-modal cross-lingual. We hope our work could bring insights about cross-lingual transfer. In future work, we would like to investigate the effectiveness of our model when transferring to multiple target languages simultaneously.

6. Acknowledgments

This work was supported by the National Key R&D Program of China (No. 2018YFB1404102), NSFC (No. 61902347, 61976188), the Public Welfare Technology Research Project of Zhejiang Province (No. LGF21F020010), the Open Projects Program of the National Laboratory of Pattern Recognition, and the Fundamental Research Funds for the Provincial Universities of Zhejiang.

References

(1)
Aggarwal and Kale (2020) Pranav Aggarwal and Ajinkya Kale. 2020. Towards zero-shot Cross-lingual Image retrieval. arXiv preprint arXiv:2012.05107 (2020).
Ali et al. (2022) Ameen Ali, Idan Schwartz, Tamir Hazan, and Lior Wolf. 2022. Video and Text Matching With Conditioned Embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1565–1574.
Anwaar et al. (2021) Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Compositional Learning of Image-Text Query for Image Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1140–1149.
Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. In Proceedings of the Sixth International Conference on Learning Representations.
Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7 (2019), 597–610.
Burns et al. (2020) Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A Plummer. 2020. Learning to scale multilingual representations for vision-language tasks. In European Conference on Computer Vision. 197–213.
Cao et al. (2022) Shuqiang Cao, Bairui Wang, Wei Zhang, and Lin Ma. 2022. Visual Consensus Modeling for Video-Text Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence.
Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
Chang et al. (2015) Xiaojun Chang, Yi Yang, Alexander Hauptmann, Eric P Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In Twenty-fourth international joint conference on artificial intelligence.
Chen et al. (2021) Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789–15798.
Chen et al. (2020) Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.
Conneau et al. (2017) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017).
Cui et al. (2019) Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2019. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on image processing 29 (2019), 1271–1284.
Cui et al. (2021) Qu Cui, Shujian Huang, Jiahuan Li, Xiang Geng, Zaixiang Zheng, Guoping Huang, and Jiajun Chen. 2021. Directqe: Direct pretraining for machine translation quality estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12719–12727.
Dalton et al. (2013) Jeffrey Dalton, James Allan, and Pranav Mirajkar. 2013. Zero-shot video retrieval using content and concepts. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 1857–1860.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248–255.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota, 4171–4186.
Dong et al. (2018) Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting Visual Features From Text for Image and Video Caption Retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
Dong et al. (2019) Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Dong et al. (2021) Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Dong et al. (2022) Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2022).
Elliott et al. (2016) Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language. 70–74.
Elliott and Kádár (2017) Desmond Elliott and Akos Kádár. 2017. Imagination improves multimodal translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing(Volume 1: Long Papers). 130–141.
Faghri et al. (2018) Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. Vse++: Improving visual-semantic embeddings with hard negatives. Proceedings of the British Machine Vision Conference.
Fei et al. (2021) Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3644–3650.
Gabeur et al. (2022) Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2022. Masking Modalities for Cross-modal Video Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1766–1775.
Gabeur et al. (2020) Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision. 214–229.
Gan et al. (2020) Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems 33 (2020), 6616–6628.
Habibian et al. (2016) Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. 2016. Video2vec embeddings recognize events when examples are scarce. IEEE transactions on pattern analysis and machine intelligence 39, 10 (2016), 2089–2103.
Han et al. (2021) Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826–3834.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
He et al. (2021) Yi He, Xin Liu, Yiu-Ming Cheung, Shu-Juan Peng, Jinhan Yi, and Wentao Fan. 2021. Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1865–1869.
Hong et al. (2021) Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1379–1388.
Huang et al. (2021b) Dandan Huang, Kun Wang, and Yue Zhang. 2021b. A Comparison between Pre-training and Large-scale Back-translation for Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 1718–1732.
Huang et al. (2021a) Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G Hauptmann. 2021a. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2443–2459.
Huang et al. (2017) Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2310–2318.
Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. 4904–4916.
Kim et al. (2020) Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. 2020. Mule: Multimodal universal language embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11254–11261.
Lan et al. (2017) Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th ACM international conference on Multimedia. 1549–1557.
Lei et al. (2021a) Jie Lei, Tamara L Berg, and Mohit Bansal. 2021a. mtvr: Multilingual moment retrieval in videos. arXiv preprint arXiv:2108.00061 (2021).
Lei et al. (2021b) Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021b. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.
Li et al. (2020a) Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336–11344.
Li et al. (2019b) Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019b. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Li et al. (2021) Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. 2021. Value: A multi-task benchmark for video-and-language understanding evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1.
Li et al. (2016) Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. 2016. Adding chinese captions to images. In Proceedings of the 2016 ACM on international conference on multimedia retrieval. 271–275.
Li et al. (2019a) Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019a. W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM International Conference on Multimedia. 1786–1794.
Li et al. (2020b) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020b. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. 121–137.
Li et al. (2020c) Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2020c. Sea: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia 23 (2020), 4351–4362.
Liu et al. (2021b) An-An Liu, Heyu Zhou, Weizhi Nie, Zhenguang Liu, Wu Liu, Hongtao Xie, Zhendong Mao, Xuanya Li, and Dan Song. 2021b. Hierarchical multi-view context modelling for 3D object classification and retrieval. Information Sciences 547 (2021), 984–995.
Liu et al. (2021a) Hongying Liu, Ruyi Luo, Fanhua Shang, Mantang Niu, and Yuanyuan Liu. 2021a. Progressive Semantic Matching for Video-Text Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5083–5091.
Liu et al. (2019) Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
Liu et al. (2017) Yu Liu, Yanming Guo, Erwin M Bakker, and Michael S Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.
Liu et al. (2022) Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. 2022. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11006–11016.
Lu et al. (2019) Xu Lu, Lei Zhu, Zhiyong Cheng, Liqiang Nie, and Huaxiang Zhang. 2019. Online multi-modal hashing with dynamic query-adaption. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 715–724.
Luo et al. (2021a) Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021a. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021).
Luo et al. (2021b) Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. 2021b. CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600–5608.
Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 151–159.
Markatopoulou et al. (2017) Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 407–411.
Mettes et al. (2020) Pascal Mettes, Dennis C Koelma, and Cees GM Snoek. 2020. Shuffled imagenet banks for video event detection and search. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–21.
Miech et al. (2021) Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9826–9836.
Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Ni et al. (2021) Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, and Nan Duan. 2021. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3977–3986.
Park et al. (2021) Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heui-Seok Lim. 2021. Should we find another model?: Improving neural machine translation performance with one-piece tokenization method without model modification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. 97–104.
Philip et al. (2021) Jerin Philip, Shashank Siripragada, Vinay P Namboodiri, and CV Jawahar. 2021. Revisiting low resource status of indian languages in machine translation. In 8th ACM IKDD CODS and 26th COMAD. 178–187.
Portaz et al. (2019) Maxime Portaz, Hicham Randrianarivo, Adrien Nivaggioli, Estelle Maudet, Christophe Servan, and Sylvain Peyronnet. 2019. Image search using multilingual texts: a cross-modal learning approach between image and text. arXiv preprint arXiv:1903.11299 (2019).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
Song et al. (2021b) Xue Song, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2021b. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021).
Song et al. (2021a) Yuqing Song, Shizhe Chen, Qin Jin, Wei Luo, Jun Xie, and Fei Huang. 2021a. Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training. In Proceedings of the 29th ACM International Conference on Multimedia. 2843–2852.
Song and Soleymani (2019) Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979–1988.
Wang et al. (2018) Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 394–407.
Wang et al. (2022a) Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2022a. Many Hands Make Light Work: Transferring Knowledge from Auxiliary Tasks for Video-Text Retrieval. IEEE Transactions on Multimedia (2022).
Wang et al. (2019) Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
Wang et al. (2022b) Yunxiao Wang, Meng Liu, Yinwei Wei, Zhiyong Cheng, Yinglong Wang, and Liqiang Nie. 2022b. Siamese Alignment Network for Weakly Supervised Video Moment Retrieval. IEEE Transactions on Multimedia (2022).
Wehrmann et al. (2019) Jonatas Wehrmann, Douglas M Souza, Mauricio A Lopes, and Rodrigo C Barros. 2019. Language-agnostic visual-semantic embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5804–5813.
Wei et al. (2021) Jiwei Wei, Xing Xu, Zheng Wang, and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In Proceedings of the 29th ACM International Conference on Multimedia. 3835–3843.
Wei et al. (2019) Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019), 1–14.
Wu et al. (2021) Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical Alignment Networks for Video-Text Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3518–3527.
Xiao et al. (2021) Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2986–2994.
Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500.
Xu et al. (2016) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 5288–5296.
Yang et al. (2020) Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1339–1348.
Yang et al. (2021) Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.
Yang et al. (2019) Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. 2019. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307 (2019).
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
Yu et al. (2021) Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, and Ping Li. 2021. Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2444–2454.
Yu et al. (2018) Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision. 471–487.
Zeng et al. (2021) Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching. In Proceedings of the 29th ACM International Conference on Multimedia. 2205–2213.
Zhou et al. (2021) Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4155–4165.
Zhu and Yang (2020) Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746–8755.

Figure 7. Examples of cross-lingual text-to-video retrieval on VATEX. We present top 3 videos retrieved for each query in the target language (Chinese), given by Basic Model and NRCCR respectively.

Appendix A Cross-lingual Text-Video retrieval

a.1. Qualitative Experiment

To qualitatively illustrate model behavior, Figure 7 lists three Chinese queries selected from VATEX and top 3 videos retrieved by Basic Model and NRCCR, respectively. For Query 1 and Query 2, NRCCR successfully ranks the ground truth as top 1 and the top three retrieved videos are typically relevant to the given query to some extent. By contrast, Basic Model does not understand the query accurately and thus fails to answer these queries. The reason might be that some noisy sentences generated by MT cannot correctly describe the corresponding video content, but Basic Model is still forced to close the distances between them during training, which hurts the encoding of clean Chinese sentences badly. The failure of Basic Model in these cases reveals the importance of noise-robust learning for video-text retrieval in cross-lingual scenarios. For Query 3, although both models fail to answer the query, compared with Basic Model, the content of top 3 videos retrieved by NRCCR is still highly related to the query. The bad case for NRCCR might suggest that NRCCR can capture global semantic information contained in input sentences, but it still has limitations in fine-grained semantic understanding.

Figure 8. Qualitative results of cross-lingual text-to-text retrieval on VATEX. For each query, the top 3 ranked translated English sentences and the ground-truth sentences are shown. If the ground-truth sentence is among the top 3, the fourth sentence is included as well.

Appendix B Cross-lingual Text-Text Retrieval

b.1. Quantitative Experiment

To further verify the robustness of our model for text encoding, we conduct quantitative analysis of cross-lingual text-to-text retrieval on the Chinese test set of VATEX. Specifically, following the setting in the main text, we first train NRCCR and Basic Model on VATEX and translate the Chinese test set into English. Given a Chinese sentence, if an English sentence is translated from it, we assume they are relevant, otherwise they are irrelevant. During evaluation, for each Chinese sentence, we retrieve its relevant English sentence according to the cosine similarity between sentence embeddings produced by NRCCR and Basic Model respectively. mAP is used as the retrieval metric and results are shown in Table 5. We observe that NRCCR significantly outperforms the Basic Model by 19.44 on mAP, indicating that NRCCR achieves a robust cross-lingual alignment through noise-robust learning.

Method	mAP
Basic Model	61.84
NRCCR	81.28

Table 5. The quantitative comparison of cross-lingual text-to-text retrieval (ZH-to-EN). The performance is reported in percentage (%).

b.2. Qualitative Experiment

As shown in Figure 8, we qualitatively analyze the results of the cross-lingual text-to-text retrieval. For each Chinese query, only the English sentence machine translated from the query is regarded as the ground truth. It is observed that top retrieved sentences in Figure 8 are semantically similar to the given query to some extent in Query 1 and Query 2. In Query 3, the blue word is a typo produced during manual annotation, and the typo is machine translated as ”discount” in the ground truth without correction. Given such a noisy query, however, the top retrieved sentences are all related to the correct meaning of the query. This strongly confirms the robustness of our model, which could extract proper global semantic information from noisy sentences. For Query 4, although our model fails to answer this query, the top retrieved sentences have a certain semantic relevance to the query. This might suggest that our model pays more attention to coarse-grained global information than fine-grained information, as a result, it has limitations in capturing fine-grained information (e.g. objects and actions) in the input sentence, which has also been discussed in Section A.1.

Appendix C Implementation Details

For video features, on MSR-VTT-CN, we use the frame-level features ResNet-152 (He et al., 2016) and concatenate frame-level features ResNeXt-101 (Xie et al., 2017)(Mettes et al., 2020) to obtain a combined 4,096-dimensional feature. On VATEX, we adopt the I3D (Carreira and Zisserman, 2017) video feature which is officially provided. As for model structure, we employ a 1-layer Transformer with 8 heads and 1024 hidden units as the visual encoder. For text encoding, we use the pre-trained mBERT-base as the backbone and freeze layers below 7 as this setup empirically performs the best. We let the heads of Transformer be 8, and only one block we use, which we find that a single block is able to achieves the expected performance. For the optimization, we use PyTorch as our deep learning environment. The training batch size is 128. An Adam optimizer with initial learning rate 1e-4 and adjustment schedule simialr to (Dong et al., 2022) is utilized. For hyperparameters in Equation 12 (in the main text), we set $λ_{1} = 0.4, λ_{2} = 0.1, λ_{3} = 0.5$ and $λ_{4} = 1 e - 3$ . The hyperparameter $α$ in the overall triplet loss is set to 0.6 and the weight $β$ used for inference is set to 0.8. Note that we will release our source code and data.