Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end fine-tuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency.
translated by 谷歌翻译
MINSU(移动库存和扫描单元)算法使用计算视觉分析方法记录机柜的剩余数量/填充度。为此,它通过了五步方法:对象检测,前景减法,K-均值聚类,百分比估计和计数。输入图像通过对象检测方法,以分析机柜在坐标方面的特定位置。这样做之后,它会通过前景减法方法来使图像通过删除背景更加焦点到机柜本身(某些手动工作可能必须完成,例如选择不被算法切割的零件) 。在K-均值聚类方法中,多色图像变成了3彩色单调图像,以更快,更准确的分析。最后,图像经过百分比估计和计数。在这两种方法中,发现机柜内部的材料的比例以百分比为百分比,然后用来近似内部的材料数量。如果该项目成功,剩余数量管理可以解决简介早期解决的问题。
translated by 谷歌翻译
半监督域适应(SSDA)是将学习者调整到新域,只有一小组标记的数据集在源域上给出时,只有一小组标记的样本。在本文中,我们提出了一种基于对的SSDA方法,使用用样品对的自蒸馏来适应靶域的模型。每个样本对由来自标记数据集(即源或标记为目标)的教师样本以及来自未标记数据集的学生样本(即,未标记的目标)组成。我们的方法通过在教师和学生之间传输中间样式来生成助手功能,然后通过最小化学生和助手之间的输出差异来培训模型。在培训期间,助手逐渐弥合了两个域之间的差异,从而让学生容易地从老师那里学习。标准基准测试的实验评估表明,我们的方法有效地减少了域间和域内的差异,从而实现了对最近的方法的显着改进。
translated by 谷歌翻译
条件生成的对抗性网络(CGAN)通过将类信息纳入GaN来生成现实图像。虽然最受欢迎的CGANS是一种辅助分类器GAN,但众所周知,随着数据集中的类别的数量增加,培训acgan正在挑战。偶数还倾向于产生缺乏多样性的容易甲型样本。在本文中,我们介绍了两种治疗方法。首先,我们识别分类器中的梯度爆炸可能会导致早期训练中的不良崩溃,并将输入向量投影到单元间隔子上可以解决问题。其次,我们提出了数据到数据跨熵丢失(D2D-CE)来利用类标记的数据集中的关系信息。在这个基础上,我们提出了重新启动的辅助分类器生成对抗网络(Reacgan)。实验结果表明,Reacgan在CIFAR10,微小想象成,CUB200和Imagenet数据集上实现了最先进的生成结果。我们还验证了来自可分辨率的增强的ReacanggaN的利益,以及D2D-CE与Stylegan2架构协调。模型权重和提供代表性CGANS实现的软件包和我们纸上的所有实验都可以在https://github.com/postech-cvlab/pytorch-studiogan获得。
translated by 谷歌翻译
我们提出了一种新颖的高保真表达语音合成模型,unitts,学习和控制重叠的样式属性避免干扰。 Unitts表示在应用属性之前和之后的音素嵌入之间的残差在单个统一的嵌入空间中表示多种样式属性。所提出的方法在控制难以清洁的多个属性方面是特别有效的,例如扬声器ID和情感,因为它在增加扬声器ID和情绪的方差时最小化冗余,并且另外,预测基于的持续时间,间距和能量发言人身份证和情感。在实验中,可视化结果表明,所提出的方法以可以轻松分离的方式学习多个属性。同样,unitts合成的高保真语音信号控制多种样式属性。合成的语音样本呈现在https://jackson-kang.github.io/pake_works/unitts/demos。
translated by 谷歌翻译
An unbiased scene graph generation (SGG) algorithm referred to as Skew Class-balanced Re-weighting (SCR) is proposed for considering the unbiased predicate prediction caused by the long-tailed distribution. The prior works focus mainly on alleviating the deteriorating performances of the minority predicate predictions, showing drastic dropping recall scores, i.e., losing the majority predicate performances. It has not yet correctly analyzed the trade-off between majority and minority predicate performances in the limited SGG datasets. In this paper, to alleviate the issue, the Skew Class-balanced Re-weighting (SCR) loss function is considered for the unbiased SGG models. Leveraged by the skewness of biased predicate predictions, the SCR estimates the target predicate weight coefficient and then re-weights more to the biased predicates for better trading-off between the majority predicates and the minority ones. Extensive experiments conducted on the standard Visual Genome dataset and Open Image V4 \& V6 show the performances and generality of the SCR with the traditional SGG models.
translated by 谷歌翻译
In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we present a dual encoder model called BagFormer that utilizes a cross modal interaction mechanism to improve recall performance without sacrificing latency and throughput. BagFormer achieves this through the use of bag-wise interactions, which allow for the transformation of text to a more appropriate granularity and the incorporation of entity knowledge into the model. Our experiments demonstrate that BagFormer is able to achieve results comparable to state-of-the-art single encoder models in cross-modal retrieval tasks, while also offering efficient training and inference with 20.72 times lower latency and 25.74 times higher throughput.
translated by 谷歌翻译
Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (<50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.
translated by 谷歌翻译
Three-dimensional (3D) ultrasound imaging technique has been applied for scoliosis assessment, but current assessment method only uses coronal projection image and cannot illustrate the 3D deformity and vertebra rotation. The vertebra detection is essential to reveal 3D spine information, but the detection task is challenging due to complex data and limited annotations. We propose VertMatch, a two-step framework to detect vertebral structures in 3D ultrasound volume by utilizing unlabeled data in semi-supervised manner. The first step is to detect the possible positions of structures on transverse slice globally, and then the local patches are cropped based on detected positions. The second step is to distinguish whether the patches contain real vertebral structures and screen the predicted positions from the first step. VertMatch develops three novel components for semi-supervised learning: for position detection in the first step, (1) anatomical prior is used to screen pseudo labels generated from confidence threshold method; (2) multi-slice consistency is used to utilize more unlabeled data by inputting multiple adjacent slices; (3) for patch identification in the second step, the categories are rebalanced in each batch to solve imbalance problem. Experimental results demonstrate that VertMatch can detect vertebra accurately in ultrasound volume and outperforms state-of-the-art methods. VertMatch is also validated in clinical application on forty ultrasound scans, and it can be a promising approach for 3D assessment of scoliosis.
translated by 谷歌翻译
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
translated by 谷歌翻译