Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
translated by 谷歌翻译
While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model. SpeechLMScore computes the average log-probability of a speech signal by mapping it into discrete tokens and measures the average probability of generating the sequence of tokens. Therefore, it does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks including voice conversion, text-to-speech, and speech enhancement.
translated by 谷歌翻译
Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.
translated by 谷歌翻译
Deep neural networks (DNNs) have rapidly become a \textit{de facto} choice for medical image understanding tasks. However, DNNs are notoriously fragile to the class imbalance in image classification. We further point out that such imbalance fragility can be amplified when it comes to more sophisticated tasks such as pathology localization, as imbalances in such problems can have highly complex and often implicit forms of presence. For example, different pathology can have different sizes or colors (w.r.t.the background), different underlying demographic distributions, and in general different difficulty levels to recognize, even in a meticulously curated balanced distribution of training data. In this paper, we propose to use pruning to automatically and adaptively identify \textit{hard-to-learn} (HTL) training samples, and improve pathology localization by attending them explicitly, during training in \textit{supervised, semi-supervised, and weakly-supervised} settings. Our main inspiration is drawn from the recent finding that deep classification models have difficult-to-memorize samples and those may be effectively exposed through network pruning \cite{hooker2019compressed} - and we extend such observation beyond classification for the first time. We also present an interesting demographic analysis which illustrates HTLs ability to capture complex demographic imbalances. Our extensive experiments on the Skin Lesion Localization task in multiple training settings by paying additional attention to HTLs show significant improvement of localization performance by $\sim$2-3\%.
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
AI-powered Medical Imaging has recently achieved enormous attention due to its ability to provide fast-paced healthcare diagnoses. However, it usually suffers from a lack of high-quality datasets due to high annotation cost, inter-observer variability, human annotator error, and errors in computer-generated labels. Deep learning models trained on noisy labelled datasets are sensitive to the noise type and lead to less generalization on the unseen samples. To address this challenge, we propose a Robust Stochastic Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a topic from multiple sources to ensure deterrence in learning noisy information. More specifically, RoS-KD learns a smooth, well-informed, and robust student manifold by distilling knowledge from multiple teachers trained on overlapping subsets of training data. Our extensive experiments on popular medical imaging classification tasks (cardiopulmonary disease and lesion classification) using real-world datasets, show the performance benefit of RoS-KD, its ability to distill knowledge from many popular large networks (ResNet-50, DenseNet-121, MobileNet-V2) in a comparatively small network, and its robustness to adversarial attacks (PGD, FSGM). More specifically, RoS-KD achieves >2% and >4% improvement on F1-score for lesion classification and cardiopulmonary disease classification tasks, respectively, when the underlying student is ResNet-18 against recent competitive knowledge distillation baseline. Additionally, on cardiopulmonary disease classification task, RoS-KD outperforms most of the SOTA baselines by ~1% gain in AUC score.
translated by 谷歌翻译
基于激光传感器的同时定位和映射(SLAM)已被移动机器人和自动驾驶汽车广泛采用。这些大满贯系统需要用有限的计算资源来支持准确的本地化。特别是,点云注册,即,在全球坐标框架中在多个位置收集的多个LIDAR扫描匹配和对齐的过程被视为SLAM的瓶颈步骤。在本文中,我们提出了一种功能过滤算法Pfilter,可以过滤无效的功能,因此可以大大减轻这种瓶颈。同时,由于精心策划的特征点,总体注册精度也得到了提高。我们将PFILTER集成到公认的扫描到映射激光射击轨道框架F-LOAM,并评估其在KITTI数据集中的性能。实验结果表明,pfilter可以删除本地特征图中约48.4%的点,并将扫描中的特征点平均减少19.3%,从而节省每帧的处理时间20.9%。同时,我们将准确性提高了9.4%。
translated by 谷歌翻译
成像检查(例如胸部X射线照相)将产生一小部分常见发现和一组少数罕见的发现。虽然训练有素的放射科医生可以通过研究一些代表性的例子来学习罕见条件的视觉呈现,但是教机器从这种“长尾”分布中学习的情况更加困难,因为标准方法很容易偏向最常见的类别。在本文中,我们介绍了胸部X射线胸腔疾病特定领域的长尾学习问题的全面基准研究。我们专注于从自然分布的胸部X射线数据中学习,不仅优化了分类精度,不仅是常见的“头”类,而且还优化了罕见但至关重要的“尾巴”类。为此,我们引入了一个具有挑战性的新长尾X射线基准,以促进开发长尾学习方法进行医学图像分类。该基准由两个用于19-和20向胸部疾病分类的胸部X射线数据集组成,其中包含多达53,000的类别,只有7个标记的训练图像。我们在这种新的基准上评估了标准和最先进的长尾学习方法,分析这些方法的哪些方面对长尾医学图像分类最有益,并总结了对未来算法设计的见解。数据集,训练有素的模型和代码可在https://github.com/vita-group/longtailcxr上找到。
translated by 谷歌翻译
我们设计了神经动力状态估计(Neuro-DSE),这是一种基于学习的动态状态估计(DSE)算法,用于未知子系统下网络微电网(NMS)。我们的贡献包括:1)具有部分未识别的动态模型的NMS DSE的数据驱动的神经-DSE算法,该算法将神经异常 - 差异方程式(ODE-NET)融合到Kalman滤波器中; 2)一种自动过滤,增强和校正框架,可以在有限和嘈杂的测量下实现数据驱动DSE的自我修复神经-DSE算法(Neuro-DSE+); 3)一种神经-Kalmannet-DSE算法,该算法将Kalmannet与Neuro-DSE进一步整合在一起,以缓解基于神经和物理的动态模型的模型不匹配; 4)增强的神经-DSE,用于NMS状态和未知参数的联合估计(例如,惯性)。广泛的案例研究表明,在不同的噪声水平,控制模式,电源,观察力和模型知识下,神经-DSE及其变体的疗效。
translated by 谷歌翻译
本文提出了一种新颖的统一特征优化(UFO)范式,用于训练和在现实世界和大规模场景下进行深层模型,这需要集合多个AI功能。不明飞行物的目标是通过对所有任务进行大规模预修。与众所周知的基础模型相比,UFO具有两个不同的重点,即相对较小的模型大小,没有适应性成本:1)UFO以多任务学习方式将广泛的任务挤入中等尺寸的统一模型中并在转移到下游任务时进一步修剪模型大小。 2)不明飞行物不强调转移到新任务。相反,它旨在使修剪模型专门用于一个或多个已经看到的任务。有了这两个特征,UFO为灵活的部署提供了极大的便利,同时保持了大规模预处理的好处。 UFO的一个关键优点是修剪过程不仅可以减少模型的大小和推理消耗,而且还提高了某些任务的准确性。具体而言,UFO考虑了多任务培训,并对统一模型产生了两倍的影响:一些密切相关的任务具有相互利益,而某些任务相互冲突。不明飞行物设法通过新颖的网络体系结构搜索(NAS)方法来减少冲突并保留相互利益。对各种深度表示学习任务(即面部识别,人重新识别,车辆重新识别和产品检索)的实验表明,从UFO中修剪的模型比单件任务训练的对应物更高,但却具有更高的准确性较小的型号大小,验证不明飞行物的概念。此外,UFO还支持发布170亿个参数计算机视觉(CV)基础模型,该模型是该行业中最大的CV模型。
translated by 谷歌翻译