We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
Although sketch-to-photo retrieval has a wide range of applications, it is costly to obtain paired and rich-labeled ground truth. Differently, photo retrieval data is easier to acquire. Therefore, previous works pre-train their models on rich-labeled photo retrieval data (i.e., source domain) and then fine-tune them on the limited-labeled sketch-to-photo retrieval data (i.e., target domain). However, without co-training source and target data, source domain knowledge might be forgotten during the fine-tuning process, while simply co-training them may cause negative transfer due to domain gaps. Moreover, identity label spaces of source data and target data are generally disjoint and therefore conventional category-level Domain Adaptation (DA) is not directly applicable. To address these issues, we propose an Instance-level Heterogeneous Domain Adaptation (IHDA) framework. We apply the fine-tuning strategy for identity label learning, aiming to transfer the instance-level knowledge in an inductive transfer manner. Meanwhile, labeled attributes from the source data are selected to form a shared label space for source and target domains. Guided by shared attributes, DA is utilized to bridge cross-dataset domain gaps and heterogeneous domain gaps, which transfers instance-level knowledge in a transductive transfer manner. Experiments show that our method has set a new state of the art on three sketch-to-photo image retrieval benchmarks without extra annotations, which opens the door to train more effective models on limited-labeled heterogeneous image retrieval tasks. Related codes are available at https://github.com/fandulu/IHDA.
translated by 谷歌翻译
深度学习视频动作识别(AR)的成功促使研究人员逐步将相关任务从粗糙级别促进到细粒度水平。与仅预测整个视频的动作标签的常规AR相比,已经研究了时间动作检测(TAD),以估算视频中每个动作的开始和结束时间。将TAD进一步迈进,已经研究了时空动作检测(SAD),用于在视频中在空间和时间上定位该动作。但是,执行动作的人通常在SAD中被忽略,同时识别演员也很重要。为此,我们提出了一项新的任务,即演员识别的时空动作检测(ASAD),以弥合SAD和Actor识别之间的差距。在ASAD中,我们不仅检测到实例级别的动作的时空边界,还为每个参与者分配了唯一的ID。要接近ASAD,多个对象跟踪(MOT)和动作分类(AC)是两个基本要素。通过使用MOT,获得了每个参与者的时空边界,并分配给独特的演员身份。通过使用AC,在相应的时空边界内估计了动作类别。由于ASAD是一项新任务,因此它提出了许多新的挑战,这些挑战无法通过现有方法解决:i)没有专门为ASAD创建数据集,ii)ii)ii)没有为ASAD设计评估指标,iii)当前的MOT性能是获得的瓶颈令人满意的ASAD结果。为了解决这些问题,我们为i)注释一个新的ASAD数据集,ii)提出ASAD评估指标,通过考虑多标签动作和参与者的识别,iii)提高数据关联策略以提高MOT性能,从而提高MOT性能更好的ASAD结果。该代码可在\ url {https://github.com/fandulu/asad}中获得。
translated by 谷歌翻译
Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model. We provide pretrained models and audio samples of Nix-TTS.
translated by 谷歌翻译