Partially Relevant Video Retrieval

Jianfeng Dong , Xianke Chen Zhejiang Gongshang University , Minsong Zhang Zhejiang Gongshang University , Xun Yang University of Science and Technology of China , Shujie Chen Zhejiang Gongshang University , Xirong Li Key Lab of DEKE, Renmin University of China and Xun Wang Zhejiang Gongshang University

2022

Abstract.

Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.

Video-Text Retrieval, Partially Relevant, Multiple Instance Learning, Video Representation Learning

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 30th ACM International Conference on Multimedia; October 10–14, 2022; Lisboa, Portugal^†^†booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal^†^†price: 15.00^†^†doi: 10.1145/3503161.3547976^†^†isbn: 978-1-4503-9203-7/22/10^†^†ccs: Information systems Video search

Two textual queries partially relevant to a given video. Only a specific moment in the video is relevant to the corresponding query, while the other frames are irrelevant. We formulate the task of partially relevant video retrieval (PRVR) as a multiple instance learning problem, and propose a MS-SL network.
MS-SL first detects a key clip that is most likely to be relevant to the query. Then, the importance of each frame is measured in a fine-grained temporal scale under the guidance of the key clip. The final similarity is computed by jointly considering the query’s similarities with the key clip and the frames. — Figure 1. Two textual queries partially relevant to a given video. Only a specific moment in the video is relevant to the corresponding query, while the other frames are irrelevant. We formulate the task of partially relevant video retrieval (PRVR) as a multiple instance learning problem, and propose a MS-SL network. MS-SL first detects a key clip that is most likely to be relevant to the query. Then, the importance of each frame is measured in a fine-grained temporal scale under the guidance of the key clip. The final similarity is computed by jointly considering the query’s similarities with the key clip and the frames.

1. Introduction

With the advent of the big data era, millions of videos are uploaded to the Internet every day. There is an increasing need of retrieving videos from the big data. As common users prefer to express their information need by natural-language queries, research on text-to-video retrieval (T2VR) is important (Li et al., 2020; Yang et al., 2020; Croitoru et al., 2021). Given a query in the form of a natural language sentence, T2VR asks to retrieve videos that are semantically relevant to the given query from a gallery of videos. Current methods (Liu et al., 2019a; Jin et al., 2021; Chen et al., 2020b; Gabeur et al., 2020; Han et al., 2021; Luo et al., 2021) for T2VR are trained and tested on video-captioning oriented datasets such as MSVD (Chen and Dolan, 2011), MSRVTT (Xu et al., 2016) and VATEX (Wang et al., 2019). A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world.

To fill the above gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query as long as the video contains a (short) moment relevant w.r.t. the query, see Fig. 1. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. Because where the relevant moment is localized and how long it lasts are both unknown (see Fig. 2), PRVR is more challenging than the conventional T2VR task.

Observing the connection between PRVR and Multiple Instance Learning (MIL) (Dietterich et al., 1997; Maron and Lozano-Pérez, 1997) at a high level, we tackle the new task by a multi-scale MIL approach. In the current context, a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different temporal scales, which is helpful for handling moments of varying temporal lengths. Besides, based on the multi-scale video representation, we propose Multi-Scale Similarity Learning (MS-SL) network. In MS-SL, we regard the clip-scale as the coarse temporal granularity as it is typically of longer duration. Besides, the frame-scale is regraded as the fine-grained temporal granularity, as frames usually reflect more detailed content of videos. The multi-scale similarity learning consists of a clip-scale SL branch built on clip-scale video representation and a frame-scale SL branch built on frame-scale video representation. They are jointly learned in a coarse-to-fine manner. Note that the two similarity learning branches are not independent. In clip-scale SL, a key clip that is most likely to be relevant to the query will be detected. Then the clip-scale similarity is computed as the similarity between the key clip and the query. Additionally, the key clip is regarded as a guide for frame-scale SL to measure the importance of each frame in a fine-grained temporal scale. The frame-scale similarity is computed as the similarity between the weighted frames with the query. Finally, the clip-scale similarity and the frame-scale similarity are jointly used to measure the final video-text similarity.

It is worth noting that PRVR differs from Single Video Moment Retrieval (SVMR) (Yuan et al., 2019b; Zhang et al., 2020b; Xiao et al., 2021; Yang et al., 2021) and Video Corpus Moment Retrieval (VCMR) (Lei et al., 2020; Zhang et al., 2020a, 2021; Hou et al., 2021), as the latter two are to retrieve moments rather than untrimmed videos. Additionally, although our model is proposed for PRVR, it can also be used for improving VCMR. In sum, our main contributions are as follows:
$∙$ We propose a new T2VR subtask named PRVR, where an untrimmed video is considered to be partially relevant with respect to a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos.
$∙$ We formulate the PRVR subtask as a MIL problem, simultaneously viewing a video as a bag of video clips and a bag of video frames. Clips and frames represent video content at different temporal scales. Based on multi-scale video representation, we propose MS-SL to compute the relevance between videos and queries in a coarse-to-fine manner.
$∙$ Extensive experiments on three datasets (TVR (Lei et al., 2020), ActivityNet Captions (Krishna et al., 2017), and Charades-STA (Gao et al., 2017)) demonstrate the viability of the proposed method for PRVR. We also show that our method can be used for improving video corpus moment retrieval. Source code and datasets are available at http://danieljf24.github.io/prvr

Distribution of moment-to-video ratio on (a) TVR and (b) ActivityNet Captions. Moment-to-video ratio indicates the moment’s length ratio in the entire video. Moments show a large variance in their temporal lengths. — (a) TVR

2. Related Work

T2VR. The T2VR task has gained much attention in recent years (Dong et al., 2018; Song and Soleymani, 2019; Chen et al., 2020b; Ging et al., 2020; Yang et al., 2020; Han et al., 2021; Liu et al., 2021b; Wang et al., 2020b; Hu et al., 2022), aiming to retrieve relevant videos by a given query from a set of pre-trimmed video clips. The retrieved clips are supposed to be fully relevant to the given query. A common solution for T2VR is to first encode videos and textual queries and then map them into common embedding spaces where the cross-modal similarity is measured. Therefore, current works mainly focus on video encoding (Liu et al., 2019a; Jin et al., 2021; Feng et al., 2021; Song et al., 2021), sentence encoding (Chen et al., 2020b; Li et al., 2020; Croitoru et al., 2021), and their cross-modal similarity learning (Yu et al., 2018; Dong et al., 2021; Gabeur et al., 2020; Wu et al., 2021). Different from the above works, we consider a more realistic scenario, where videos are supposed to be partially relevant to a specific query. We thus focus more on how to measure the partial relevance between textual queries and videos.

VMR. The VMR task is to retrieve moments semantically relevant to the given query from a given single untrimmed video or a large collection of untrimmed videos. The former is known as SVMR (Anne Hendricks et al., 2017; Liu et al., 2021a; Zheng et al., 2022; Qu et al., 2020; Liu et al., 2020; Yang et al., 2022; Wang et al., 2021), and the latter is known as VCMR (Escorcia et al., 2019; Paul et al., 2021; Zhang et al., 2020a; Wang et al., 2022). In SVMR, existing methods mainly concentrate on how to precisely localize temporal boundings of target moments, and could be typically classified as proposal-based methods (Chen et al., 2018; Yuan et al., 2019a; Zhang et al., 2019; Wang et al., 2020a; Gao and Xu, 2021) and proposal-free methods (Yuan et al., 2019b; Qu et al., 2020; Chen et al., 2020a). Proposal-based methods first generate multiple moment proposals, then match them with a query to determine the most relevant one from the proposals. Without generating moment proposals, proposal-free methods predict the start and end time points of the target moment based on the fused video-query feature. As the extension of SVMR, the VCMR task is to retrieve moments (or video segments) that are semantically relevant w.r.t. a given query from a collection of untrimmed videos. The state-of-the-art methods (e.g., ReLoCLNet (Zhang et al., 2021) and XML (Lei et al., 2020)) for VCMR have a two-stage workflow. The first stage is to retrieve a number of candidate videos which may contain the target moment, while the second stage is to retrieve moments from the candidate videos.

Different from video moment retrieval aiming to retrieve moments, our proposed PRVR task aims to retrieve untrimmed videos. Besides, while PRVR is similar to VCMR’s first stage yet requires no moment-level annotations as commonly needed for VCMR. Therefore, a method for PRVR can in principle be used to improve a two-stage method for VCMR, and our proposed model is designed for PRVR, it can also be used for improving VCMR.

MIL. MIL (Dietterich et al., 1997; Maron and Lozano-Pérez, 1997) is a classical framework for learning from weakly annotated data, and widely used for classification tasks (Li et al., 2021b, a). In MIL, a sample is defined as a bag of multiple instances, and there is only a label associated with the bag instead of the instance. Besides, a bag is positive if the bag contains at least one positive instance and negative if it contains no such positive instance. Existing MIL methods could be roughly grouped into instance-based methods (Pinheiro and Collobert, 2015; Oquab et al., 2015; Feng and Zhou, 2017) and embedding-based methods (Ilse et al., 2018; Tu et al., 2019; Li et al., 2021a). The former typically predicts a score of each instance in the bag and aggregates them to generate a bag score. The latter usually aggregates embedding of all instances into a bag embedding, then outputs a bag score based on the bag embedding. In this work, we formulate the PRVR task as a MIL problem. Different from the previous MIL works that usually regrade a sample as a specific bag of instances, in this work a video is simultaneously viewed as a bag of video clips and a bag of video frames. Moreover, we employ MIL for the retrieval task instead of the classification task.

The framework of our proposed model for partially relevant video retrieval.
# — Figure 3. The framework of our proposed model for partially relevant video retrieval. # $k$ denotes a temporal sliding window of size $k$ with a stride of 1.

3. Our Method

We formulate PRVR as a MIL problem. As moments relevant to queries typically show large variations in their temporal lengths, we devise multi-scale video representation to represent videos at multiple temporal scales, obtaining a bag of video clips of varying lengths and a bag of video frames. Based on the two bags, we further propose multi-scale similarity learning to measure the partial query-video relevance, see Fig. 3.

3.1. Formulation of PRVR

Given a natural language query, the task of PRVR aims to retrieve videos containing a moment that is semantically relevant to the given query, from a large corpus of untrimmed videos. As the moment referred to by the query is typically a small part of a video, we argue that the query is partially relevant to the video. It is worth pointing out that PRVR is different from conventional T2V retrieval (Dong et al., 2019; Chen et al., 2020b; Han et al., 2021), where videos are pre-trimmed and much shorter, and queries are usually fully relevant to the whole video.

To build a PRVR model, a set of untrimmed videos are given for training, where each video is associated with multiple natural language sentences. Each sentence describes the content of a specific moment in the corresponding video. Note that we do not have access to the start/end time points of the moments (moment annotations) referred to by the sentences.

3.2. Sentence Representation

For sentence representation, we adopt the method by Lei et al. (Lei et al., 2020), considering its good performance on VCMR. Specifically, given a sentence consisting of $n_{q}$ words, a pre-trained RoBERTa (Liu et al., 2019b) is firstly employed to extract word features. Then a fully connected (FC) layer with a ReLU activation is utilized to map the word features into a lower-dimensional space. After adding the learned positional embedding to the mapped features, a standard Transformer layer (Vaswani et al., 2017) is further employed to obtain a sequence of $d$ -dimensional contextualized word feature vectors $Q = {q_{i}}_{i = 1}^{n_{q}} \in R^{d \times n_{q}}$ . In the Transformer, the features are fed to a multi-head attention layer followed by a feed-forward layer, and both layers are connected with residual connection (He et al., 2016) and layer normalization (Ba et al., 2016). Finally, a sentence-level representation $q \in R^{d}$ is obtained by employing a simple attention on $Q$ :

(1)

q = n_{q} \sum i = 1 α_{i}^{q} \times q_{i}, α^{q} = S o f t m a x (w^{T} Q),

where $S o f t m a x$ denotes softmax layer, $w \in R^{d \times 1}$ is trainable vector, and $α^{q} \in R^{1 \times n_{q}}$ indicates the attention vector.

3.3. Multi-Scale Video Representation

Given an untrimmed video, we first represent it by a sequence of $d_{v}$ -dimensional feature vectors $V \in R^{d_{v} \times n_{v}}$ , where $n_{v}$ denotes the number of the vectors. The feature sequence is obtained by extracting frame-level features using a pre-trained 2D CNN, or extracting segment-level features using a pre-trained 3D CNN. For the ease of description, we regard $V$ as a sequence of frame-level features in the following. Based on $V$ , we construct multi-scale video representation, jointly using a clip-scale feature learning branch and a frame-scale feature learning branch.

3.3.1. Clip-scale video representation

Before constructing video clips, we first downsample the input in the temporal domain to reduce the length of the feature sequence, which helps reduce the computational complexity of the model. Specifically, given a sequence of frame feature vectors $V$ as the input, we downsample them into a fixed number of feature vectors, where each feature vector is obtained by mean pooling over the corresponding multiple consecutive frame features. Then, the video is described by a sequence of new feature vectors $U \in R^{d_{v} \times n_{u}}$ , where $n_{u}$ indicates the number of the corresponding feature vectors. In order to make the features more compact, we employ an FC layer with a ReLU activation. Moreover, we also use standard Transformer (Vaswani et al., 2017) with a learned positional embedding to improve the temporal dependency of the features. Formally, through an FC layer and a one-layer Transformer, we obtain $U^{'} \in R^{d \times n_{u}}$ :

(2)

\begin{matrix} U^{'} = {u_{1}, u_{2}, . . ., u_{n_{u}}} = T r a n s f o r m e r (F C (U) + P E) \end{matrix}

where $P E$ denotes the output of the positional embedding. The reduced feature array $U^{'}$ is further used for clip-scale video representation learning.

For clip construction, we employ a multi-scale sliding window strategy to generate video clips, as illustrated in Fig. 3. Note that different from previous work (Zhang et al., 2020a) where video clips are of equal length and non-overlapping, our video clips are of varied lengths and overlapping. Concretely, we apply sliding windows of different sizes over $U^{'}$ along its temporal dimension with a stride of 1. Given a sliding window of size $k$ , a clip feature is obtained by mean pooling over the features within the given window. The resultant feature sequence is denoted as $Φ_{k}$ . Consequently, by jointly employing sliding windows of varied sizes as ${1, 2, . . ., n_{u}}$ , we are able to obtain ${Φ_{1}, Φ_{2}, . . ., Φ_{n_{u}}}$ . Putting them together, a video can be represented a sequence of video clips $C \in R^{d \times n_{c}}$ :

(3)

C = {Φ_{1}, Φ_{2}, . . ., Φ_{n_{u}}} = {c_{1}, c_{2}, . . ., c_{n_{c}}},

where $c_{i} \in R^{d}$ denotes the feature representation of $i$ -th clip, $n_{c}$ is the number of all generated clips which meets $n_{c} = n_{u} (n_{u} + 1) / 2$ .

3.3.2. Frame-scale video representation

As the initial frame features are extracted independently, they naturally lack temporal dependency. To bring such dependency back, we again utilize Transformers. Specifically, given the frame feature sequence $V$ , we first utilize an FC layer with ReLU activation to reduce the dimensionality of the input, followed by a standard Transformer with a positional embedding layer. The re-encoded frame features, denoted $F \in R^{d \times n_{v}}$ , are computed as:

(4)

F = {f_{1}, f_{2}, . . ., f_{n_{v}}} = T r a n s f o r m e r (F C (V) + P E) .

Note that the network structures of Transformer, FC and PE are the same as that in the clip-scale branch, but their trainable parameters are not shared. This allows each branch to learn parameters suitable for their own scale.

Figure 4. The illustration of multi-scale similarity learning.

3.4. Multi-Scale Similarity Learning

As we have no priori about where the relevant content is localized in PRVR, it is challenging to directly compute video-text similarity on a fine-grained scale. Here, we propose multi-scale similarity learning, which computes the similarity in a coarse-to-fine manner. It first detects a key clip that is most likely to be relevant to the query. Then, the importance of each frame is measured in a fine-grained temporal scale under the guidance of the key clip. The final similarity is computed by jointly considering the query’s similarities with the key clip and the frames. The hypothesis here is that if one model briefly knows coarse relevant content with respect to the query, it will help the model to find more relevant content on a more fine-grained scale accurately. The framework of the multi-scale similarity learning is illustrated in Fig. 4.

3.4.1. Clip-scale Similarity

Given a video as a sequence of video clips, we first measure the cross-modal similarity of each video clip with the query, and then aggregate the instance similarities to obtain the clip-scale similarity. Specifically, given a sequence of video clips $C = {c_{1}, c_{2}, . . ., c_{n_{c}}}$ , we use cosine similarity between each instance representation and the query representation, followed by a max-pooling operator on the similarities. More formally, the clip-scale similarity is obtained as:

(5)

S_{c} (v, q) = max {c o s (c_{1}, q), c o s (c_{2}, q), . . ., c o s (c_{n_{c}}, q)},

where $c o s (\cdot)$ denotes the cosine similarity function. The max-pooling determines the clip with the highest similarity, and we then utilize its similarity with the query as the whole video’s similarity with the query. Besides, we select this clip as key clip, which is used for the latter frame-scale similarity learning.

3.4.2. Frame-scale Similarity

To obtain the frame-scale similarity, we first aggregate a sequence of frame feature vectors to a feature vector under the guidance of the key clip obtained in Section 3.4.1, and then compute its similarity with the query as the frame-scale similarity. Specifically, given a sequence of video frames $F = {f_{1}, f_{2}, . . ., f_{n_{v}}}$ , we devise a Key Clip Guided Attention (KCGA) to aggregate frame features. The implementation of KCGA borrows the idea of multi-head self-attention (MHSA) mechanism in Transformer (Vaswani et al., 2017). MHSA first projects the input into queries, keys, values, and then computes the output as a weighted sum of the values. The weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Different from MHSA utilizes the same input to construct queries, keys, and values, here we take the feature vector of the key clip as the query, and the video frame features as keys and values. Formally, the aggregated frame feature vector is obtained as:

(6)

r = S o f t m a x ({~ c}^{T} K) Z^{T}, K = W_{k} F, Z = W_{v} F,

where $~ c \in R^{d \times 1}$ indicates the feature vector of the key clip, $W_{k} \in R^{d \times d}$ and $W_{k} \in R^{d \times d}$ are two trainable projection matrices. The dot product measures the similarity between frames and the key clip, resulting in larger values for frames that are more similar to the key clip. Therefore, frames that are more similar to the key clip will have greater attention weights.

Finally, the frame-scale similarity is measured as the cosine similarity between the aggregated frame feature vector $r$ and query feature vector $q$ , namely:

(7)

S_{f} (v, q) = c o s (r, q) .

3.4.3. Similarity Learning

In this section, we first introduce the definition of positive and negative pairs for similarity learning. Inspired by MIL (Dietterich et al., 1997; Maron and Lozano-Pérez, 1997), we define that a query and video pair is positive if the video contains certain content that is relevant to the query, and negative if no relevant content in the video.

Based on the above definition, we jointly use the triplet ranking loss (Faghri et al., 2018; Dong et al., 2021) and InfoNCE loss (Miech et al., 2020; Zhang et al., 2021) that are widely used in retrieve related tasks, and found them complementary. Given a positive video-query pair $(v, q)$ , the triplet ranking loss over the mini-batch $B$ is defined as:

(8)

where $m$ is the margin constant, $S (\cdot)$ denotes the similarity function which we can use the clip-scale similarity $S (\cdot)_{c}$ or the frame-scale similarity $S (\cdot)_{f}$ . Besides, $q^{-}$ and $v^{-}$ respectively indicate a negative sentence sample for $v$ and a negative video sample for $s$ . The negative samples are randomly sampled from the mini-batch at the beginning of the training, while being the hardest negative samples after 20 epochs.

Given a positive video-query pair $(v, q)$ , the infoNCE loss over the mini-batch $B$ is computed as:

(9)		$L^{n c e} = - \frac{1}{n} \sum (q, v) \in B ⎡ ⎣ l o g ⎛ ⎝ \frac{S (q, v)}{S (q, v) + \sum_{q_{i}^{-} \in N_{q}} S (q_{i}^{-}, v)} ⎞ ⎠$
(9)		$+ l o g ⎛ ⎝ \frac{S (q, v)}{S (q, v) + \sum_{v_{i}^{-} \in N_{v}} S (q, v_{i}^{-})} ⎞ ⎠ ⎤ ⎦,$

where $N_{q}$ denotes all negative queries of the video $v$ in the mini-batch, while $N_{v}$ denotes all negative videos of the query $q$ in the mini-batch.

As the previous work (Li et al., 2020) has concluded that using one loss per similarity function performs better than using one loss on the summation of multiple similarities, we employ the above two losses on both clip-scale similarity and frame-scale similarity, instead of their sum. Finally, our model is trained by minimizing the following overall training loss:

(10)

L = L_{c}^{t r i p} + L_{f}^{t r i p} + λ_{1} L_{c}^{n c e} + λ_{2} L_{f}^{n c e},

where $L_{c}^{t r i p}$ and $L_{f}^{t r i p}$ denote the triplet ranking loss using the clip-scale similarity $S (\cdot)_{c}$ and frame-scale similarity $S (\cdot)_{f}$ respectively, and accordingly for $L_{c}^{n c e}$ and $L_{f}^{n c e}$ . $λ_{1}$ and $λ_{2}$ are hyper-parameters to balance the contribution of infoNCE loss.

3.5. Model Inference

After the model has been trained, the similarity between a video and a sentence query is computed as the sum of their clip-level similarity and frame-level similarity, namely:

(11)

S (v, s) = α S_{c} (v, s) + (1 - α) S_{f} (v, s)

where $α$ is a hyper-parameter to balance the importance of two similarities, ranging within [0, 1]. Given a query, we sort all videos from the video gallery in descending order according to their similarity with respect to the given video.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

In order to verify the viability of our proposed model for PRVR, queries that are partially relevant to videos are required. As videos in popular T2VR datasets such as MSR-VTT (Xu et al., 2016), MSVD (Chen and Dolan, 2011) and VATEX (Wang et al., 2019) are supposed to be fully relevant to the queries, they are not suited for our experiments. Here, we re-purpose three datasets commonly used for VCMR, i.e., TVR (Lei et al., 2020), Activitynet Captions (Krishna et al., 2017), and Charades-STA (Gao et al., 2017), considering their natural language queries partially relevant with the corresponding videos (a query is typically associated with a specific moment in a video). Table 1 summarizes the brief statistics of these datasets, including average lengths of moments and videos, and the average moment length proportion in the whole video (moment-to-video ratio). Note that as we focus on retrieving videos, moment annotations provided by these datasets are not used in our proposed new PRVR task.

Datasets	Average length		Moment-to-video ratio
Datasets	moments	videos	min	max	mean
TVR	9.1	76.2	0.48%	100%	11.9%
Activitynet Captions	36.2	117.6	0.48%	100%	30.8%
Charades-STA	8.1	30.0	4.3%	100%	26.3%

Table 1. Brief statistics of three public datasets used in our experiments. The length is measured in seconds.

TV show Retrieval (TVR) (Lei et al., 2020) is a multimodal dataset originally for video corpus moment retrieval, where videos are paired with subtitles that are generated by automatic speech recognition. It contains 21.8K videos collected from 6 TV shows, and each video is associated with 5 natural language sentences that describe a specific moment in the video. As a moment is typically a part of a video, we assume that sentences are partially relevant to videos, and use them to evaluate our model. Following (Zhang et al., 2020a, 2021), we utilize 17,435 videos with 87,175 moments for training and 2,179 videos with 10,895 moments for testing.

ActivityNet Captions (Krishna et al., 2017) is originally developed for dense video captioning task, and is now a popular dataset for single video moment retrieval. It contains around 20K videos from Youtube, and the average length of videos is the largest among the three datasets we used. On average, each video has around 3.7 moments with corresponding sentence descriptions.We use the popular data partition used in (Zhang et al., 2020a, 2021).

Charades-STA (Gao et al., 2017) is a dataset for single video moment retrieval. It contains 6,670 videos with 16,128 sentence descriptions. Each video has around 2.4 moments with corresponding sentence descriptions on average. We utilize the official data partition for model training and evaluation.

4.1.2. Evaluation Metrics

To evaluate PRVR models, we utilize the rank-based metrics, namely $R @ K$ ( $K = 1, 5, 10, 100$ ), which are commonly used for the conventional text-to-video retrieval (Wang et al., 2020b; Dong et al., 2021). $R @ K$ is the fraction of queries that correctly retrieve desired items in the top $K$ of the ranking list. The performance is reported in percentage (%). Higher $R @ K$ means better performance. For overall comparison, we also report the Sum of all Recalls (SumR).

4.1.3. Implementation Details

We use PyTorch as our deep learning environment, and we will release our source code. For video feature, on TVR, we utilize the feature provided by (Lei et al., 2020), that is, 3,072-D visual feature obtained by the concatenation of frame-level ResNet152 (He et al., 2016) feature and segment-level I3D (Carreira and Zisserman, 2017) feature. For ease of reference, we refer to it as ResNet152-I3D. On ActivityNet-Captions and Charades-STA, we only utilize the same I3D feature, which are respectively provided by (Zhang et al., 2020a) and (Mun et al., 2020). For sentence feature, we use the 768-D RoBERTa feature provided by (Lei et al., 2020) on TVR, where RoBERTa is finetuned on the queries and subtitle sentences of TVR. On ActivityNet-Captions and Charades-STA, we use the 1,024-D RoBERTa feature extracted by ourselves using the open RoBERTa toolkit¹¹1https://pytorch.org/hub/pytorch_fairseq_roberta/. Due to the limited space of the paper, we present more detailed implementation details in the supplementary material.

4.2. Comparison with Baseline Methods

Model	R@1	R@5	R@10	R100	SumR
T2VR models:
W2VV, TMM18 (Dong et al., 2018)	2.6	5.6	7.5	20.6	36.3
HGR, CVPR20 (Chen et al., 2020b)	1.7	4.9	8.3	35.2	50.1
HTM, ICCV19 (Miech et al., 2019)	3.8	12.0	19.1	63.2	98.2
CE, BMVC19 (Liu et al., 2019a)	3.7	12.8	20.1	64.5	101.1
W2VV++, MM19 (Li et al., 2019)	5.0	14.7	21.7	61.8	103.2
VSE++, BMVC19 (Faghri et al., 2018)	7.5	19.9	27.7	66.0	121.1
DE, CVPR19 (Dong et al., 2019)	7.6	20.1	28.1	67.6	123.4
DE++, TPAMI21 (Dong et al., 2021)	8.8	21.9	30.2	67.4	128.3
RIVRL, TCSVT22 (Dong et al., 2022)	9.4	23.4	32.2	70.6	135.6
VCMR models w/o moment localization:
XML, ECCV20 (Lei et al., 2020)	10.0	26.5	37.3	81.3	155.1
ReLoCLNet, SIGIR21(Zhang et al., 2021)	10.7	28.1	38.1	80.3	157.1
Ours	13.5	32.1	43.4	83.4	172.4

Table 2. Performance of PRVR on the TVR dataset. Models are sorted in ascending order in terms of their overall performance. Visual feature: ResNet152-I3D.

4.2.1. Baseline selection

As models specifically designed for PRVR are non-existing, we compare with models targeted at conventional T2VR and models developed for VCMR. Given the rich literature, we have to be selective, choosing open-source models for fair and reproducible comparison. In particular, we choose the following nine T2VR models, i.e., VSE++ (Faghri et al., 2018), W2VV (Dong et al., 2018), CE (Liu et al., 2019a), W2VV++ (Li et al., 2019), DE (Dong et al., 2019), HTM (Miech et al., 2019), HGR (Chen et al., 2020b), DE++ (Dong et al., 2021) and RIVRL (Dong et al., 2022), and the following two VCMR models, i.e., XML (Lei et al., 2020) and ReLoCLNet(Zhang et al., 2021). Both XML and ReLoCLNet are two-stage, where a first-stage module is used to retrieve candidate videos followed by a second-stage module to localize specific moments in the candidate videos. As moment annotations are unavailable for PRVR, we have re-trained XML and ReLoCLNet (with their moment localization modules removed) using the same video features as ours.

Figure 5. Performance of different models on different types of queries. Queries are grouped according to their M/V.

4.2.2. Results on the TVR dataset

Table 2 summarizes the performance comparison on TVR. Our proposed model consistently outperforms all conventional T2VR models with a clear margin. Even the best performing model RIVRL among the T2VR models, our model outperforms it by 36.8 in terms of SumR. As these models focus on the whole similarity between videos and queries, the results allow us to conclude that such similarity modeling is sub-optimal for PRVR. For the models of the second group, i.e., ReLoCLNet and XML, they perform better than the conventional T2VR models, but they are still worse than ours. ReLoCLNet and XML focus on retrieving moment, which to some extent model the partial relevance, but they compute the similarity only in terms of a specific scale. By contrast, we compute the similarity in terms of both clip scale and frame scale. The results demonstrate the effectiveness of our proposed multi-scale similarity learning for PRVR. Note that when using the extra subtitle feature provided by (Lei et al., 2020), our model obtains better performance (R@1 of 24.0 and SumR of 220.8).

VCMR models w/o moment localization:
Model	R@1	R@5	R@10	R100	SumR
T2VR models:
W2VV (Dong et al., 2018)	2.2	9.5	16.6	45.5	73.8
HTM (Miech et al., 2019)	3.7	13.7	22.3	66.2	105.9
HGR (Chen et al., 2020b)	4.0	15.0	24.8	63.2	107.0
RIVRL (Dong et al., 2022)	5.2	18.0	28.2	66.4	117.8
VSE++ (Faghri et al., 2018)	4.9	17.7	28.2	67.1	117.9
DE++ (Dong et al., 2021)	5.3	18.4	29.2	68.0	121.0
DE (Dong et al., 2019)	5.6	18.8	29.4	67.8	121.7
W2VV++ (Li et al., 2019)	5.4	18.7	29.7	68.8	122.6
CE (Liu et al., 2019a)	5.5	19.1	29.9	71.1	125.6
ReLoCLNet (Zhang et al., 2021)	5.7	18.9	30.0	72.0	126.6
XML (Lei et al., 2020)	5.3	19.4	30.6	73.1	128.4
Ours	7.1	22.5	34.7	75.8	140.1

Table 3. Performance of PRVR on the ActivityNet Captions dataset. Visual feature: I3D.

VCMR models w/o moment localization:
Model	R@1	R@5	R@10	R100	SumR
T2VR models:
W2VV (Dong et al., 2018)	0.5	2.9	4.7	24.5	32.6
VSE++ (Faghri et al., 2018)	0.8	3.9	7.2	31.7	43.6
W2VV++ (Li et al., 2019)	0.9	3.5	6.6	34.3	45.3
HGR (Chen et al., 2020b)	1.2	3.8	7.3	33.4	45.7
CE (Liu et al., 2019a)	1.3	4.5	7.3	36.0	49.1
DE (Dong et al., 2019)	1.5	5.7	9.5	36.9	53.7
DE++ (Dong et al., 2021)	1.7	5.6	9.6	37.1	54.1
RIVRL(Dong et al., 2022)	1.6	5.6	9.4	37.7	54.3
HTM (Miech et al., 2019)	1.2	5.4	9.2	44.2	60.0
ReLoCLNet (Zhang et al., 2021)	1.2	5.4	10.0	45.6	62.3
XML (Lei et al., 2020)	1.6	6.0	10.1	46.9	64.6
Ours	1.8	7.1	11.8	47.7	68.4

Table 4. Performance of PRVR on the Charades-STA dataset. Visual feature: I3D.

To gain a further understanding of the individual models, we define moment-to-video ratio (M/V) for query, which is measured by its corresponding moment’s length ratio in the entire video. The smaller M/V indicates less relevant content while more irrelevant content with respect to the query. Besides, the smaller M/V to some extent means a lower relevance of a query to its corresponding video, while the larger one indicates a higher relevance. According to M/V, queries can be automatically classified into different groups, which enables a fine-grained analysis of how a specific model responds to the different types of queries. On TVR, the 10,895 test queries are split according to their M/V into six groups, with the performance of each group shown in Fig. 5.

Unsurprisingly, our model consistently performs the best in all groups. Observing the figure from left to right, the average performance of the twelve compared models increases along with the M/V, from 106.8, 114.2 114.3, 118.6, 125.8 and 127.7. The performance in the group with the lowest M/V is the smallest, while the group with the highest M/V is the largest. The result allows us to conclude that the current video retrieval baseline models better address queries of larger relevance to the corresponding video. By contrast, the performance we achieved is more balanced in all groups. This result shows that our proposed model is less sensitive to irrelevant content in videos.

4.2.3. Results on Activitynet Captions and Charades-STA

The performance of different models on Activitynet Captions and Charades-STA are summarized in Table 3 and Table 4, respectively. On both datasets, our model is still at the leading position.The results again verify the effectiveness of our model for measuring partial relevance between videos and queries. Interestingly, we observe that HTM performs badly on TVR and Activitynet Captions, while on Charades-STA it achieves the best SumR score among the T2VR models. We speculate it is due to the fact that Charades-STA has the least training data among the three datasets. Besides, the model structure of HTM is very simple, respectively using an FC layer with gating mechanism to embed videos and sentences into a common space, showing an advantage of training on small-scale data. For our proposed model, it consistently performs the best on the three datasets of the varying number of training samples, which to some extent shows that our model is not sensitive to the scale of training data.

	W2VV	HGR	HTM	CE	W2VV++	VSE++	DE	DE++	RIVRL	XML	ReLoCLNet	Ours
FLOPs (G)	0.42	2.96	0.06	0.06	0.4	0.20	5.24	5.30	8.64	0.80	0.96	1.22
Memory (MiB)	1231	8555	1225	1435	1281	1299	5837	3515	4809	2451	2673	5349

Table 5. Model comparison in terms of FLOPs and memory consumption.

4.3. Comparison on Model Complexity

Table 5 summarizes the model complexity comparison in terms of the time complexity and memory consumption. For a specific method, its time complexity is measured as FLOPs it takes to encode a given video-text pair. In terms of FLOPs, our model is at the mid-level, slightly slower than XML and ReLoCLNet, yet faster than RIVRL, DE and HGR. In terms of memory consumption, our model requires more memory than the majority of compared models, which is mainly due to the usage of transformer and multi-scale video representations. However, we found that our model takes about 0.2 seconds to retrieve videos from 20,000 candidate untrimmed videos, given that the video embeddings are pre-computed. The retrieval speed is adequate for instant response.

Performance of XML and ReLoCLNet without/with our model as the first stage for VCMR. — (a) IoU=0.5

4.4. PRVR for VCMR

Our PRVR model can also be used in the first stage of VCMR.To that end, we replace the first stage of two VCMR models, i.e., XML (Lei et al., 2020) and ReLoCLNet (Zhang et al., 2021), with our model. Both visual and subtitle features are used for video representation.

Fig. 6 shows the performance of the original models and the replaced ones on the TVR dataset. Here, we report SumR, the sum of R1/R5/R10/R100. Replacing the first stage with our model improves both XML and ReLoCLNet.

4.5. Ablation Studies

Model	R@1	R@5	R@10	R100	SumR
Full setup	13.5	32.1	43.4	83.4	172.4
w/o frame-scale branch	12.3	30.5	41.5	82.3	166.6
w/o clip-scale branch	8.0	21.0	30.0	74.0	133.0
w/o key clip guide	12.2	30.6	41.0	82.4	166.3
w/o InfoNCE	11.3	29.1	40.1	81.3	161.8
w/o Triplet loss	11.2	29.2	40.4	81.9	162.6

Table 6. Ablation study on the TVR dataset.

4.5.1. The effectiveness of multi-scale branches

To examine the usefulness of the multi-scale branches, we compare the counterpart without the clip-scale branch or the frame-scale branch. As shown in Table 6, removing any branch results in clear performance degeneration. The result not only demonstrates the effectiveness of the multi-scale solution, but also shows the complementary of the clip-scale and the frame-scale branches.

4.5.2. The effectiveness of key clip guided attention

Additionally, we also compare the model w/o key clip guide, which is implemented by replacing key clip guided attention with a simple attention. The simple attention is implemented as Eq. 1 without any guide. As Table 6 shows, our model with the full setup still performs better, which shows the importance of key clip guided attention for PRVR.

4.5.3. The effectiveness of the combination of triplet ranking loss and InfoNCE loss

To validate the choice of joint use of the two losses, we compare the results of using either triplet ranking loss or infoNCE loss. As shown in Table 6, triplet ranking loss and InfoNCE give comparable results when used alone, but they are much worse than the model with the full setup of jointly using both. The result demonstrates the benefit of using these two losses jointly.

4.5.4. The effect of $α$ on the retrieval performance.

The influence of the hyper-parameter $α$ in Eq. 11 is studied as follows. We try $α$ with its value ranging from 0.1 to 0.9 with an interval of 0.1. As shown in Fig. 7, when the $α$ is larger than 0.3, the performance of using multi-scale similarity are all over 170, which consistently outperform the counterparts using the frame-scale or the clip-scale similarity alone.

5. Conclusions

In this paper, we have proposed a novel T2VR subtask termed PRVR. Different from the conventional T2VR where a query is usually fully relevant to the corresponding video, it is typically partially relevant in PRVR. Besides, videos in the conventional T2VR are temporally pre-trimmed with short durations, while videos are untrimmed in PRVR and a video is typically partially relevant to multiple sentences of different semantics. Additionally, PRVR differs from SVMR and VCMR, as the latter two are to retrieve moments rather than untrimmed videos. Towards PRVR, we have formulated it as a MIL problem, and propose MS-SL which computes the similarity on both clip scale and frame scale in a coarse-to-fine manner. Extensive experiments on three datasets have verified the effectiveness of MS-SL for PRVR, and have shown that it can also be used for improving VCMR.

Acknowledgements. This work was supported by the National Key R&D Program of China (2018YFB1404102), NSFC (62172420, 61902347, 61976188, 62002323), the Public Welfare Technology Research Project of Zhejiang Province (LGF21F020010), the Open Projects Program of the National Laboratory of Pattern Recognition, the Fundamental Research Funds for the Provincial Universities of Zhejiang, and Public Computing Cloud of RUC.

\balance

References

L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017) Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812. Cited by: §2.
J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §4.1.3.
D. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Cited by: §A.2, §1, §4.1.1.
J. Chen, X. Chen, L. Ma, Z. Jie, and T. Chua (2018) Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171. Cited by: §2.
L. Chen, C. Lu, S. Tang, J. Xiao, D. Zhang, C. Tan, and X. Li (2020a) Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10551–10558. Cited by: §2.
S. Chen, Y. Zhao, Q. Jin, and Q. Wu (2020b) Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10638–10647. Cited by: Table 9, §1, §2, §3.1, §4.2.1, Table 2, Table 3, Table 4.
I. Croitoru, S. Bogolin, Y. Liu, S. Albanie, M. Leordeanu, H. Jin, and A. Zisserman (2021) TEACHTEXT: crossmodal generalized distillation for text-video retrieval. Cited by: §1, §2.
T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89 (1-2), pp. 31–71. Cited by: §1, §2, §3.4.3.
J. Dong, X. Li, and C. G. Snoek (2018) Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20 (12), pp. 3377–3388. Cited by: §2, §4.2.1, Table 2, Table 3, Table 4.
J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang (2019) Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9346–9355. Cited by: Table 9, §3.1, §4.2.1, Table 2, Table 3, Table 4.
J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang, and M. Wang (2021) Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §A.2, Table 9, §2, §3.4.3, §4.1.2, §4.2.1, Table 2, Table 3, Table 4.
J. Dong, Y. Wang, X. Chen, X. Qu, X. Li, Y. He, and X. Wang (2022) Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §A.2, Table 9, §4.2.1, Table 2, Table 3, Table 4.
V. Escorcia, M. Soldan, J. Sivic, B. Ghanem, and B. Russell (2019) Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763. Cited by: §2.
F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference, pp. 935–943. Cited by: Table 9, §3.4.3, §4.2.1, Table 2, Table 3, Table 4.
J. Feng and Z. Zhou (2017) Deep miml network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, pp. 14747–14350. Cited by: §2.
Z. Feng, Z. Zeng, C. Guo, and Z. Li (2021) Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 1005–1011. Cited by: §2.
V. Gabeur, C. Sun, K. Alahari, and C. Schmid (2020) Multi-modal transformer for video retrieval. In European Conference on Computer Vision, pp. 214–229. Cited by: §1, §2.
J. Gao, C. Sun, Z. Yang, and R. Nevatia (2017) Tall: temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275. Cited by: §1, §4.1.1, §4.1.1.
J. Gao and C. Xu (2021) Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1523–1532. Cited by: §2.
S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox (2020) COOT: cooperative hierarchical transformer for video-text representation learning. Advances in Neural Information Processing Systems 33, pp. 22605–22618. Cited by: §2.
N. Han, J. Chen, G. Xiao, H. Zhang, Y. Zeng, and H. Chen (2021) Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3826–3834. Cited by: §1, §2, §3.1.
K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2, §4.1.3.
Z. Hou, C. Ngo, and W. K. Chan (2021) CONQUER: contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3900–3908. Cited by: §1.
F. Hu, A. Chen, Z. Wang, F. Zhou, J. Dong, and X. Li (2022) Lightweight attentional feature fusion: a new baseline for text-to-video retrieval. In Proceedings of the 17th European Conference on Computer Vision, Cited by: §2.
M. Ilse, J. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. In International Conference on Machine Learning, pp. 2127–2136. Cited by: §2.
W. Jin, Z. Zhao, P. Zhang, J. Zhu, X. He, and Y. Zhuang (2021) Hierarchical cross-modal graph consistency learning for video-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1114–1124. Cited by: §1, §2.
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715. Cited by: §1, §4.1.1, §4.1.1.
J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision, pp. 447–463. Cited by: §A.1.1, §A.1.2, §A.3.2, Table 8, §1, §2, §3.2, §4.1.1, §4.1.1, §4.1.3, §4.2.1, §4.2.2, §4.4, Table 2, Table 3, Table 4.
B. Li, Y. Li, and K. W. Eliceiri (2021a) Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14318–14328. Cited by: §2.
X. Li, C. Xu, G. Yang, Z. Chen, and J. Dong (2019) W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1786–1794. Cited by: §A.2, Table 9, §4.2.1, Table 2, Table 3, Table 4.
X. Li, F. Zhou, C. Xu, J. Ji, and G. Yang (2020) SEA: sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia 23, pp. 4351–4362. Cited by: Table 9, §1, §2, §3.4.3.
X. Li, Y. Zhou, J. Wang, H. Lin, J. Zhao, D. Ding, W. Yu, and Y. Chen (2021b) Multi-modal multi-instance learning for retinal disease recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 2474–2482. Cited by: §2.
D. Liu, X. Qu, J. Dong, P. Zhou, Y. Cheng, W. Wei, Z. Xu, and Y. Xie (2021a) Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11235–11244. Cited by: §2.
D. Liu, X. Qu, X. Liu, J. Dong, P. Zhou, and Z. Xu (2020) Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4070–4078. Cited by: §2.
H. Liu, R. Luo, F. Shang, M. Niu, and Y. Liu (2021b) Progressive semantic matching for video-text retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 5083–5091. Cited by: §2.
Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman (2019a) Use what you have: video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487. Cited by: Table 9, §1, §2, §4.2.1, Table 2, Table 3, Table 4.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §3.2.
H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li (2021) CLIP4clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860. Cited by: §1.
O. Maron and T. Lozano-Pérez (1997) A framework for multiple-instance learning. Advances in Neural Information Processing Systems 10. Cited by: §1, §2, §3.4.3.
A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: §3.4.3.
A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640. Cited by: §4.2.1, Table 2, Table 3, Table 4.
J. Mun, M. Cho, and B. Han (2020) Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819. Cited by: §4.1.3.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694. Cited by: §2.
S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury (2021) Text-based localization of moments in a video corpus. IEEE Transactions on Image Processing 30, pp. 8886–8899. Cited by: §2.
P. O. Pinheiro and R. Collobert (2015) From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713–1721. Cited by: §2.
X. Qu, P. Tang, Z. Zou, Y. Cheng, J. Dong, P. Zhou, and Z. Xu (2020) Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4280–4288. Cited by: §2.
X. Song, J. Chen, Z. Wu, and Y. Jiang (2021) Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia. Cited by: §2.
Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1979–1988. Cited by: §2.
M. Tu, J. Huang, X. He, and B. Zhou (2019) Multiple instance learning with graph neural networks. arXiv preprint arXiv:1906.04881. Cited by: §2.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.2, §3.3.1, §3.4.2.
J. Wang, L. Ma, and W. Jiang (2020a) Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12168–12175. Cited by: §2.
W. Wang, J. Gao, X. Yang, and C. Xu (2020b) Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia. Cited by: Table 9, §2, §4.1.2.
X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591. Cited by: §1, §4.1.1.
Y. Wang, M. Liu, Y. Wei, Z. Cheng, Y. Wang, and L. Nie (2022) Siamese alignment network for weakly supervised video moment retrieval. IEEE Transactions on Multimedia. Cited by: §2.
Z. Wang, J. Chen, and Y. Jiang (2021) Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 1459–1468. Cited by: §2.
P. Wu, X. He, M. Tang, Y. Lv, and J. Liu (2021) HANet: hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3518–3527. Cited by: §2.
S. Xiao, L. Chen, S. Zhang, W. Ji, J. Shao, L. Ye, and J. Xiao (2021) Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 2986–2994. Cited by: §1.
J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-VTT: a large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296. Cited by: §A.2, §1, §4.1.1.
X. Yang, J. Dong, Y. Cao, X. Wang, M. Wang, and T. Chua (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1339–1348. Cited by: §1, §2.
X. Yang, F. Feng, W. Ji, M. Wang, and T. Chua (2021) Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1–10. Cited by: §1.
X. Yang, S. Wang, J. Dong, J. Dong, M. Wang, and T. Chua (2022) Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31, pp. 1204–1216. Cited by: §2.
Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision, pp. 471–487. Cited by: §2.
Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu (2019a) Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Advances in Neural Information Processing Systems 32. Cited by: §2.
Y. Yuan, T. Mei, and W. Zhu (2019b) To find where you talk: temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9159–9166. Cited by: §1, §2.
B. Zhang, H. Hu, J. Lee, M. Zhao, S. Chammas, V. Jain, E. Ie, and F. Sha (2020a) A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046. Cited by: §1, §2, §3.3.1, §4.1.1, §4.1.1, §4.1.3.
D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis (2019) Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1247–1257. Cited by: §2.
H. Zhang, A. Sun, W. Jing, G. Nan, L. Zhen, J. T. Zhou, and R. S. M. Goh (2021) Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 685–695. Cited by: §A.1.2, Table 8, §1, §2, §3.4.3, §4.1.1, §4.1.1, §4.2.1, §4.4, Table 2, Table 3, Table 4.
S. Zhang, H. Peng, J. Fu, and J. Luo (2020b) Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12870–12877. Cited by: §1.
Q. Zheng, J. Dong, X. Qu, X. Yang, Y. Wang, P. Zhou, B. Liu, and X. Wang (2022) Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications, and Applications. Cited by: §2.

Appendix A Appendix

We report more experimental results and more technical details which are not included in the paper due to space limit:

More comparisons on the TVR dataset, including extra ablation studies to explore the significance of the transformer module, comparing with models using extra subtitles and comparing with conventional T2VR models using clips (Section A.1).
Performance comparison on pre-trimmed video datasets (Section A.2).
Distribution of moment-to-video ratios on Charades-STA (Section A.3.1).
More technical details of our method (Section A.3.2).

	R@1	R@5	R@10	R@100	SumR
ReLoCLNet (best baseline)	10.7	28.1	38.1	80.3	157.1
Ours (1D-CNN)	10.6	29.3	39.9	81.9	161.6
Ours (bi-LSTM)	11.2	29.0	40.2	81.9	162.3
Ours (bi-GRU)	12.4	31.6	42.9	83.3	170.2
Ours (Transformer)	13.5	32.1	43.4	83.4	172.4

Table 7. Ablations on the usage of the Transformer.

Method	R@1	R@5	R@10	R100	SumR
XML(Lei et al., 2020)	17.4	39.3	51.5	89.1	197.3
ReLoCLNet(Zhang et al., 2021)	19.1	40.3	51.5	87.0	197.9
Ours	24.0	47.8	58.8	90.2	220.8

Table 8. Performance of PRVR with using subtitle features. Visual feature:ResNet152-I3D, Subtitle feature:RoBERTa.

a.1. More Experiments on TVR

a.1.1. Ablations on the usage of the transformer

Among the compared methods on the TVR dataset, the top three ranked methods (ReLoCLNet, XML and RIVRL) use Transformer modules: ReLoCLNet and XML utilize Transformers for text and video representation, while RIVRL uses a Transformer for video representation. So we conduct extra ablation study to explore whether is the usage of the Transformer module mainly contributes the significant performance improvement of our method. We replace all the three Transformer modules in our model with 1D-CNN, bi-GRU, bi-LSTM, respectively. As shown in the Table 7, our model in the Transformer-free setup obtains SumR of 161.6, 170.2 and 162.3, respectively (Lei et al., 2020) dataset. While their performance is worse than the counterpart using Transformer, they are still better than the best baseline, i.e., , ReLoCLNet with SumR of 157.1.

a.1.2. Comparison with models using extra subtitles

As the TVR dataset is a multimodal dataset where each video is additionally associated with subtitle (dialogue) texts, in this experiment we compare with models using extra subtitles. Here, we do not compare with conventional T2VR models, as they do not support using extra subtitles. We compare with XML (Lei et al., 2020) and ReLoCLNet (Zhang et al., 2021), and use the same 768-D subtitles features provided by (Lei et al., 2020) as extra video features. The results are shown in Table 8, and our proposed model again performs the best.

a.1.3. Comparison with conventional T2VR models using clips

As the conventional T2VR models are typically designed for retrieving video clips, in this experiment we employ conventional T2VR models using clips to explore their potential for PRVR. Specifically, during the inference, we first split videos into multiple clips, and then compute the similarity of each clip with the query. The maximum similarity is regarded as the final similarity between the video and the query.

Performance of four conventional T2VR models using clips generated by (a) content-agnostic strategy and (b) content-aware strategy on the TVR dataset. Their performance is still much worse than our proposed model which achieves a SumR score of 172.4. — (a) Content-agnostic Strategy

Besides, we adopt two strategies to generate clips from video, i.e., content-agnostic strategy and content-aware strategy. The content-agnostic strategy first splits the video into $N$ video units evenly, then constructs video clips by using a specific video unit or concatenating adjacent video units. The larger $N$ means generating more video clips, and $N = 1$ indicates using the whole video for inference. The content-aware strategy generates video clips by a scene detector toolkit²²2https://github.com/Breakthrough/PySceneDetect/ with an own provided threshold $θ$ , which automatically splits the video into individual clips according to their content change. The smaller threshold means generating more video clips.

We conduct experiments with top-4 performing T2VR models on TVR, i.e., VSE++, DE, DE++, RIVRL, and the results are shown in Fig. 8. As shown in 8 (a), all the models achieve performance gains when $N$ is large than 1, and obtain the best performance when $N = 3$ . Recall that $N = 1$ indicates using the whole video for inference. The results allow us to conclude that the T2VR models could be improved by splitting the video into multiple clips with the content-agnostic strategy. However, their performance is still much worse than our proposed model which achieves a SumR score of 172.4 on TVR. Fig. 8 (b) shows the results when the content-aware strategy is used. Note that $θ = i n f$ indicates using the whole video without splitting. We found that splitting the video into multiple clips by the content-aware strategy result in relative performance degeneration, which shows the scene detector is not suitable for PRVR. Besides, we speculate it is due to the content in a moment may has scene changes, and the scene detector is likely to split a moment into multiple parts.

	R@1	R@5	R@10	SumR
On MSR-VTT:
CE, BMVC19(Liu et al., 2019a)	7.9	23.6	34.6	66.1
VSE++, BMVC19(Faghri et al., 2018)	8.7	24.3	34.1	67.1
DE, CVPR19(Dong et al., 2019)	11.1	29.4	40.3	80.8
W2VV++, MM19(Li et al., 2019)	11.1	29.6	40.5	81.2
DE++, TPAMI21(Dong et al., 2021)	11.6	30.3	41.3	83.2
HGR, CVPR20(Chen et al., 2020b)	11.1	30.5	42.1	83.7
SEA, TMM21(Li et al., 2020)	12.4	32.1	43.3	87.8
RIVRL, TCSVT22(Dong et al., 2022)	13.0	33.4	44.8	91.2
Ours	11.3	30.4	42.2	83.9
On MSVD:
DE, CVPR19(Dong et al., 2019)	20.3	46.8	59.7	126.8
CF-GNN, TMM21(Wang et al., 2020b)	22.8	50.9	63.6	137.3
W2VV++, MM19(Li et al., 2019)	22.4	51.6	64.8	138.8
SEA, TMM21(Li et al., 2020)	24.6	55.0	67.9	147.5
Ours	22.0	52.6	67.2	141.8

Table 9. Performance comparison on the MSR-VTT and MSVD dataset. Visual feature:ResNeXt101+ResNet-152.

a.2. Results on Pre-trimmed Datasets

Although our proposed model is designed for untrimmed video, it can also be utilized for retrieving pre-trimmed by text. Therefore, we conduct experiments on MSR-VTT (Xu et al., 2016) and MSVD (Chen and Dolan, 2011), two commonly used pre-trimmed datasets for T2VR. For MSR-VTT, we follow the official partition, where 6513 video clips for training, 497 video clips for validation and the remaining 2,990 video clips for testing. For MSVD, we also follow the official partition, 1200 video clips are used for training, 100 video clips for validation and 670 video clips for testing Following the previous works (Dong et al., 2021, 2022; Li et al., 2019), we use the concatenation of 2048-dim ResNeXt-101 and 2048-dim ResNet-152 features as the video feature. For text representation, we use the open RoBERTa toolkit to extract 1,024-D sentence feature.

The results are shown in the Table 9, where all the methods use the same video feature. Note that not all the compared methods report their performance on both datasets. As expected, our model is not on par with the state-of-the-art models on the two pre-trimmed datasets. Recall that the rationale for our proposed model is to first detect a key clip that is most likely to be relevant to the query and then measure the importance of other frames in a fine-grained temporal scale under the guidance of the key clip. As for pre-trimmed videos in MSR-VTT and MSVD, the majority of their frames are actually relevant w.r.t. the associated descriptions, making key clip detection unnecessary. Our method is thus suboptimal for text-to-video retrieval on MSR-VTT and MSVD.

a.3. Others

a.3.1. Distribution of Moment-to-Video Ratios

Fig. 9 shows the distribution of moment-to-video ratio on Charades-STA. Moment-to-video ratio indicates the moment’s length ratio in the entire video. Moments on Charades-STA show a large variance in their temporal lengths.

Figure 9. Distribution of moment-to-video ratio on Charades-STA. Moment-to-video ratio indicates the moment’s length ratio in the entire video.

a.3.2. More Implementation Details

For the video representation module, we set the fixed number $n_{u}$ to 32 in the downsampling strategy. Besides, we set the maxmium frame number $n_{v}$ to 128. Once the number of frame is over $n_{v}$ , it will be downsampled to $n_{v}$ . For sentences, we set the maximum length of query $n_{q}$ to 30 on TVR and Charades-STA, 64 on ActivityNet Captions, and the words outside the maximum length are simply discarded. For the Transformer module used in our model, we set its hidden size $d = 384$ , and 4 attention heads are employed. For hyper-parameters in the loss functions, we empirically set $λ_{1}$ =0.02 and $λ_{2}$ =0.04 to make all loss elements have a similar loss value at the beginning of the model training. For model training, we utilize an Adam optimizer with a mini-batch size of 128. The initial learning rate is set to 0.00025, and we take a learning rate adjustment schedule similar to (Lei et al., 2020). Early stop occurs if the validation performance does not improve in ten consecutive epochs. The maximal number of epochs is set to 100. Note that we will release our source code and data.