Learning an explainable classifier often results in low accuracy model or ends up with a huge rule set, while learning a deep model is usually more capable of handling noisy data at scale, but with the cost of hard to explain the result and weak at generalization. To mitigate this gap, we propose an end-to-end deep explainable learning approach that combines the advantage of deep model in noise handling and expert rule-based interpretability. Specifically, we propose to learn a deep data assessing model which models the data as a graph to represent the correlations among different observations, whose output will be used to extract key data features. The key features are then fed into a rule network constructed following predefined noisy expert rules with trainable parameters. As these models are correlated, we propose an end-to-end training framework, utilizing the rule classification loss to optimize the rule learning model and data assessing model at the same time. As the rule-based computation is none-differentiable, we propose a gradient linking search module to carry the gradient information from the rule learning model to the data assessing model. The proposed method is tested in an industry production system, showing comparable prediction accuracy, much higher generalization stability and better interpretability when compared with a decent deep ensemble baseline, and shows much better fitting power than pure rule-based approach.
translated by 谷歌翻译
Ultra-high resolution image segmentation has raised increasing interests in recent years due to its realistic applications. In this paper, we innovate the widely used high-resolution image segmentation pipeline, in which an ultra-high resolution image is partitioned into regular patches for local segmentation and then the local results are merged into a high-resolution semantic mask. In particular, we introduce a novel locality-aware context fusion based segmentation model to process local patches, where the relevance between local patch and its various contexts are jointly and complementarily utilized to handle the semantic regions with large variations. Additionally, we present the alternating local enhancement module that restricts the negative impact of redundant information introduced from the contexts, and thus is endowed with the ability of fixing the locality-aware features to produce refined results. Furthermore, in comprehensive experiments, we demonstrate that our model outperforms other state-of-the-art methods in public benchmarks. Our released codes are available at: https://github.com/liqiokkk/FCtL.
translated by 谷歌翻译
Crowd counting plays an important role in risk perception and early warning, traffic control and scene statistical analysis. The challenges of crowd counting in highly dense and complex scenes lie in the mutual occlusion of the human body parts, the large variation of the body scales and the complexity of imaging conditions. Deep learning based head detection is a promising method for crowd counting. However the highly concerned object detection networks cannot be well applied to this field for two main reasons. First, most of the existing head detection datasets are only annotated with the center points instead of bounding boxes which is mandatory for the canonical detectors. Second, the sample imbalance has not been overcome yet in highly dense and complex scenes because the existing loss functions calculate the positive loss at a single key point or in the entire target area with the same weight. To address these problems, We propose a novel loss function, called Mask Focal Loss, to unify the loss functions based on heatmap ground truth (GT) and binary feature map GT. Mask Focal Loss redefines the weight of the loss contributions according to the situ value of the heatmap with a Gaussian kernel. For better evaluation and comparison, a new synthetic dataset GTA\_Head is made public, including 35 sequences, 5096 images and 1732043 head labels with bounding boxes. Experimental results show the overwhelming performance and demonstrate that our proposed Mask Focal Loss is applicable to all of the canonical detectors and to various datasets with different GT. This provides a strong basis for surpassing the crowd counting methods based on density estimation.
translated by 谷歌翻译
Semantic segmentation is a high level computer vision task that assigns a label for each pixel of an image. It is challengeful to deal with extremely-imbalanced data in which the ratio of target ixels to background pixels is lower than 1:1000. Such severe input imbalance leads to output imbalance for poor model training. This paper considers three issues for extremely-imbalanced data: inspired by the region based loss, an implicit measure for the output imbalance is proposed, and an adaptive algorithm is designed for guiding the output imbalance hyperparameter selection; then it is generalized to distribution based loss for dealing with output imbalance; and finally a compound loss with our adaptive hyperparameter selection alogorithm can keep the consistency of training and inference for harmonizing the output imbalance. With four popular deep architectures on our private dataset with three input imbalance scales and three public datasets, extensive experiments demonstrate the ompetitive/promising performance of the proposed method.
translated by 谷歌翻译
传统的域泛化旨在从多个域学习域不变表示,这需要准确的注释。然而,在现实的应用方案中,收集和注释大量数据太麻烦甚至不可行。然而,Web数据提供免费午餐,以便使用丰富的风格信息访问大量未标记的数据,这些数据可以利用增强域泛化能力。在本文中,我们介绍了一个新的任务,称为半监督域泛化,研究如何互动和未标记的域名,并建立两个基准,包括一个网上爬行数据集,它造成了一种新颖的但是逼真的挑战来推动现有技术的限制。为了解决这项任务,简单的解决方案是通过伪标记与域混淆训练一起传播标签到未标记的域的类信息。考虑缩小域间隙可以提高伪标签的质量和进一步推进域不变特征学习的泛化,我们提出了一个循环学习框架,以鼓励标签传播和域泛化之间的积极反馈,有利于桥接标记的不断发展的中间域课程学习方式的未标记域。进行实验以验证我们框架的有效性。值得突出显示的是,Web爬网数据受益于我们的结果中所示的域泛化。我们的代码稍后将提供。
translated by 谷歌翻译
旨在在低维潜在空间中压缩量子信息的量子自动编码器位于量子信息领域的自动数据压缩的核心。在本文中,我们为给定的量子自动编码器建立了压缩率的上限,并提出了一种学习控制方法,用于训练自动编码器以达到最大压缩率。理论上使用特征分解和基质分化来证明压缩率的上限,这取决于输入状态的密度矩阵表示的特征值。提出了2 Q量和3 Q量系统的数值结果,以演示如何训练量子自动编码器以实现理论上最大的压缩,并比较使用不同的机器学习算法的训练性能。说明了使用量子光学系统的量子自动编码器的实验结果,以将两个2 Q Q Q Q Qubit的状态压缩为两个1 Quit状态。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译