学生对教学的评估(集合)被广泛用于大学。通常在静态PDF报告中为讲师总结了设置的结果。该报告通常包括定量评级的摘要统计数据和未分类的开放式学生评论列表。原始评论的组织不足和汇总会阻碍那些解释有关充分利用信息反馈,准确推断并设计适当教学改进的报告的人。在这项工作中,我们介绍了一个新颖的系统,集合,该系统利用情感分析,提取方面,摘要和可视化技术,以向讲师和其他审阅者提供有组织的插图。来自不同部门的十个大学教授是该系统的评估者,所有人都同意Setsum可以帮助他们更有效地解释集合结果;十分之六的讲师更喜欢我们的系统,而不是标准的静态PDF报告(而其余4个则希望两者都具有两者)。这表明我们的工作有可能在未来改革设定的报告惯例。我们的代码可从https://github.com/evahuyn/setsum获得
translated by 谷歌翻译
We propose the tensorizing flow method for estimating high-dimensional probability density functions from the observed data. The method is based on tensor-train and flow-based generative modeling. Our method first efficiently constructs an approximate density in the tensor-train form via solving the tensor cores from a linear system based on the kernel density estimators of low-dimensional marginals. We then train a continuous-time flow model from this tensor-train density to the observed empirical distribution by performing a maximum likelihood estimation. The proposed method combines the optimization-less feature of the tensor-train with the flexibility of the flow-based generative models. Numerical results are included to demonstrate the performance of the proposed method.
translated by 谷歌翻译
我们提出了一个新的照明估计和编辑框架,以从单个有限视野(LFOV)图像中生成高动力范围(HDR)室内全景照明,该图像由低动力范围(LDR)摄像机捕获。现有的照明估计方法要么直接回归照明表示参数,要么将此问题分解为LFOV到panorama和LDR-TO-HDR照明子任务。但是,由于部分观察,高动力范围的照明以及场景的内在歧义,照明估计仍然是一项艰巨的任务。为了解决这个问题,我们建议将LDR和HDR Panorama合成融合到统一框架中,提出了一个耦合的双式全景全景合成网络(Stylelight)。 LDR和HDR Panorama合成共享类似的发电机,但具有单独的歧视器。在推断期间,给定LDR LFOV图像,我们提出了一种焦点掩盖的GAN反转方法,以通过LDR Panorama合成分支找到其潜在代码,然后通过HDR Panorama合成分支合成HDR Panorama。 Stylelight将LFOV-TO-PANORAMA和LDR-HDR LIGHTING GENTARTION带入统一的框架,从而大大改善了照明估计。广泛的实验表明,我们的框架在室内照明估计上实现了优于最先进方法的表现。值得注意的是,Stylelight还可以在室内HDR Panoramas上进行直观的照明编辑,这适用于现实世界中的应用。代码可从https://style-light.github.io获得。
translated by 谷歌翻译
课程学习开始在语音增强区中茁壮成长,使原始频谱估计任务将原始频谱估计任务分成多个更容易的子任务以实现更好的性能。由此,我们提出了一种双分支关注变压器,称为DB-Aiat,以并行地处理光谱的粗糙和细粒度。根据互补视角,提出了一种幅度掩蔽分支以粗略地估计整体幅度谱,并且同时设计复杂的精制分支,设计成补偿缺失的光谱细节和隐式导出的相位信息。在每个分支机构内,我们提出了一种新的注意力互感器的模块,以替换用于时间序列建模的传统RNN和时间卷积网络。具体地,提出的注意力变压器包括自适应时间 - 频率注意力变压器块和自适应分层关注模块,旨在捕获长期时间频率依赖性以及进一步聚合全局分层上下文信息。语音库+需求的实验结果表明,DB-AIAT在以前的高级系统上产生了最先进的性能(例如,3.31 PESQ,95.6%的STOI和10.79dB SSNR),其型号尺寸相对较小(2.81米)。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
translated by 谷歌翻译
In this paper, we propose a novel framework dubbed peer learning to deal with the problem of biased scene graph generation (SGG). This framework uses predicate sampling and consensus voting (PSCV) to encourage different peers to learn from each other, improving model diversity and mitigating bias in SGG. To address the heavily long-tailed distribution of predicate classes, we propose to use predicate sampling to divide and conquer this issue. As a result, the model is less biased and makes more balanced predicate predictions. Specifically, one peer may not be sufficiently diverse to discriminate between different levels of predicate distributions. Therefore, we sample the data distribution based on frequency of predicates into sub-distributions, selecting head, body, and tail classes to combine and feed to different peers as complementary predicate knowledge during the training process. The complementary predicate knowledge of these peers is then ensembled utilizing a consensus voting strategy, which simulates a civilized voting process in our society that emphasizes the majority opinion and diminishes the minority opinion. This approach ensures that the learned representations of each peer are optimally adapted to the various data distributions. Extensive experiments on the Visual Genome dataset demonstrate that PSCV outperforms previous methods. We have established a new state-of-the-art (SOTA) on the SGCls task by achieving a mean of \textbf{31.6}.
translated by 谷歌翻译