Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
To facilitate research on text generation, this paper presents a comprehensive and unified library, TextBox 2.0, focusing on the use of pre-trained language models (PLMs). To be comprehensive, our library covers $13$ common text generation tasks and their corresponding $83$ datasets and further incorporates $45$ PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs. We also implement $4$ efficient training strategies and provide $4$ generation objectives for pre-training new PLMs from scratch. To be unified, we design the interfaces to support the entire research pipeline (from data loading to training and evaluation), ensuring that each step can be fulfilled in a unified way. Despite the rich functionality, it is easy to use our library, either through the friendly Python API or command line. To validate the effectiveness of our library, we conduct extensive experiments and exemplify four types of research scenarios. The project is released at the link: https://github.com/RUCAIBox/TextBox.
translated by 谷歌翻译
Determining causal effects of temporal multi-intervention assists decision-making. Restricted by time-varying bias, selection bias, and interactions of multiple interventions, the disentanglement and estimation of multiple treatment effects from individual temporal data is still rare. To tackle these challenges, we propose a comprehensive framework of temporal counterfactual forecasting from an individual multiple treatment perspective (TCFimt). TCFimt constructs adversarial tasks in a seq2seq framework to alleviate selection and time-varying bias and designs a contrastive learning-based block to decouple a mixed treatment effect into separated main treatment effects and causal interactions which further improves estimation accuracy. Through implementing experiments on two real-world datasets from distinct fields, the proposed method shows satisfactory performance in predicting future outcomes with specific treatments and in choosing optimal treatment type and timing than state-of-the-art methods.
translated by 谷歌翻译
尽管在产生流利的文本方面取得了进步,但现有的预训练模型倾向于在产生诸如故事和新闻之类的叙述时将不连贯的事件序列附加到相关实体上。我们猜想,这些问题是由将实体表示为浅表词的静态嵌入而导致的,同时忽略了对其不断变化的状态建模,即随着文本的展开,即它们所携带的信息。因此,我们将变压器模型扩展到动态执行实体状态更新和叙事生成的句子实现。我们提出了一个对比框架,以在离散空间中学习状态表示,并将其他注意层插入解码器中以更好地利用这些状态。两个叙述数据集的实验表明,与有意义的实体状态的指导相比,我们的模型可以产生更多的连贯和多样化的叙事。
translated by 谷歌翻译
我们介绍了一种新颖的骨干架构,提高特征表示的目标感知能力。具体地,已经观察到事实上框架简单地使用来自骨干网的输出来执行特征匹配,从备份目标本地化,没有从匹配模块到骨干网的直接反馈,尤其是浅层。更具体地,只有匹配模块可以直接访问目标信息(在参考帧中),而候选帧的表示学习对参考目标是盲目的。结果,浅级中的目标 - 无关干扰的累积效果可能降低更深层的特征质量。在本文中,我们通过在暹罗类似的骨干网(inbn)内进行多个分支 - 方面交互来从不同角度接近问题。在INBN的核心是一个通用交互建模器(GIM),其将参考图像的先前知识注入骨干网络的不同阶段,导致候选特征表示的更好的目标感知和鲁棒的牵引力,其计算成本具有可忽略的计算成本。所提出的GIM模块和INBN机制是一般的,适用于不同的骨干类型,包括CNN和变压器,以改进,如我们在多个基准上的广泛实验所证明的那样。特别是,CNN版本(基于Siamcar),分别在Lasot / TNL2K上改善了3.2 / 6.9的Suc绝对收益。变压器版本获取Lasot / TNL2K的SUC 25.7 / 52.0,与最近的艺术态度相提并论。代码和模型将被释放。
translated by 谷歌翻译
将对象检测和ID嵌入提取到统一网络的单次多对象跟踪,近年来取得了开创性的结果。然而,目前的单次追踪器仅依赖于单帧检测来预测候选界限盒,当面对灾难性的视觉下降时,例如运动模糊,闭塞时可能是不可靠的。一旦检测器错误地被错误地归类为背景,将不再维护其相应的ROCKLET的时间一致性。在本文中,我们首先通过提出重新检查网络恢复被错误分类为“假背景”的边界框。重新检查网络创新地扩展了ID从数据关联嵌入ID的角色,以通过有效地将先前的轨迹传播到具有小开销的当前帧的运动预测。请注意,传播结果由独立和有效的嵌入搜索产生,防止模型过度依赖于检测结果。最终,它有助于重新加载“假背景”并修复破碎的Tracklet。在强大的基线Cstrack上建立一个新的单次追踪器,分别通过70.7 $ 76.4,70.6 $ \右前场达到76.3美元的MOT17和MOT17。它还达到了新的最先进的Mota和IDF1性能。代码在https://github.com/judasdie/sots发布。
translated by 谷歌翻译
Mitosis nuclei count is one of the important indicators for the pathological diagnosis of breast cancer. The manual annotation needs experienced pathologists, which is very time-consuming and inefficient. With the development of deep learning methods, some models with good performance have emerged, but the generalization ability should be further strengthened. In this paper, we propose a two-stage mitosis segmentation and classification method, named SCMitosis. Firstly, the segmentation performance with a high recall rate is achieved by the proposed depthwise separable convolution residual block and channel-spatial attention gate. Then, a classification network is cascaded to further improve the detection performance of mitosis nuclei. The proposed model is verified on the ICPR 2012 dataset, and the highest F-score value of 0.8687 is obtained compared with the current state-of-the-art algorithms. In addition, the model also achieves good performance on GZMH dataset, which is prepared by our group and will be firstly released with the publication of this paper. The code will be available at: https://github.com/antifen/mitosis-nuclei-segmentation.
translated by 谷歌翻译
The rapid development of remote sensing technologies have gained significant attention due to their ability to accurately localize, classify, and segment objects from aerial images. These technologies are commonly used in unmanned aerial vehicles (UAVs) equipped with high-resolution cameras or sensors to capture data over large areas. This data is useful for various applications, such as monitoring and inspecting cities, towns, and terrains. In this paper, we presented a method for classifying and segmenting city road traffic dashed lines from aerial images using deep learning models such as U-Net and SegNet. The annotated data is used to train these models, which are then used to classify and segment the aerial image into two classes: dashed lines and non-dashed lines. However, the deep learning model may not be able to identify all dashed lines due to poor painting or occlusion by trees or shadows. To address this issue, we proposed a method to add missed lines to the segmentation output. We also extracted the x and y coordinates of each dashed line from the segmentation output, which can be used by city planners to construct a CAD file for digital visualization of the roads.
translated by 谷歌翻译
3D object detection with surround-view images is an essential task for autonomous driving. In this work, we propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images. We design a novel projective cross-attention mechanism for query-image interaction to address the limitations of existing methods in terms of geometric cue exploitation and information loss for cross-view objects. In addition, we introduce a heatmap generation technique that bridges 3D and 2D spaces efficiently via query initialization. Furthermore, unlike the common practice of fusing intermediate spatial features for temporal aggregation, we provide a new perspective by introducing a novel hybrid approach that performs cross-frame fusion over past object queries and image features, enabling efficient and robust modeling of temporal information. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of the proposed DETR4D.
translated by 谷歌翻译
Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.
translated by 谷歌翻译