Descriptive video service (DVS) provides linguistic descriptions of moviesand allows visually impaired people to follow a movie along with their peers.Such descriptions are by design mainly visual and thus naturally form aninteresting data source for computer vision and computational linguistics. Inthis work we propose a novel dataset which contains transcribed DVS, which istemporally aligned to full length HD movies. In addition we also collected thealigned movie scripts which have been used in prior work and compare the twodifferent sources of descriptions. In total the Movie Description datasetcontains a parallel corpus of over 54,000 sentences and video snippets from 72HD movies. We characterize the dataset by benchmarking different approaches forgenerating video descriptions. Comparing DVS to scripts, we find that DVS isfar more visual and describes precisely what is shown rather than what shouldhappen according to the scripts created prior to movie production.
translated by 谷歌翻译