Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
translated by 谷歌翻译
Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
translated by 谷歌翻译
机上的机器学习使本地客户端推荐模型的轻量级部署可以减轻基于云的推荐人的负担,并同时结合了更多实时用户功能。然而,考虑到其强大的模型能力以及从十亿级项目库中产生的有效候选人,该行业的基于云的建议仍然非常重要。以前的尝试将两种范式的优点整合起来主要诉诸于顺序机制,该机制在基于云的建议之上构建了在设备上的推荐人。但是,当用户兴趣发生巨大变化时,这种设计是不灵活的:设备模型被有限的项目缓存粘住,而基于大型项目池的基于云的推荐则没有新的重新汇总反馈。为了克服这个问题,我们提出了一个元控制器,以动态管理推荐装置推荐人与基于云的推荐人之间的协作,并从因果角度引入一种新颖的有效样本构造,以解决元控制者的数据集缺失问题。在反事实样本和扩展培训的基础上,在工业推荐方案中进行的广泛实验显示了在设备云协作中Meta控制器的承诺。
translated by 谷歌翻译
在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
联合学习(FL)的令人难以置信的发展使计算机视觉和自然语言处理领域的各种任务受益,而现有的TFF和FATE等现有框架使在现实应用程序中的部署变得容易。但是,即使图形数据很普遍,联合图形学习(FGL)由于其独特的特征和要求而没有得到很好的支持。缺乏与FGL相关的框架增加了完成可再现研究和在现实世界应用中部署的努力。在本文中,我们首先讨论了创建易于使用的FGL软件包的挑战,因此提出了我们实施的FederatedScope-GNN(FS-G)的包裹,该软件包提供了(1)统一的模块化视图并表达FGL算法; (2)用于开箱即用的FGL功能的综合数据和模型; (3)有效的模型自动调整组件; (4)现成的隐私攻击和防御能力。我们通过进行广泛的实验来验证FS-G的有效性,该实验同时获得了许多有关FGL的宝贵见解。此外,我们采用FS-G在现实世界中的电子商务方案中为FGL应用程序提供服务,在该场景中获得的改进表明了巨大的潜在业务利益。我们在https://github.com/alibaba/federatedscope上公开发布FS-G,作为FederatedScope的子模型,以促进FGL的研究,并启用由于缺乏专用包装而无法无视的广泛应用。
translated by 谷歌翻译
尽管现有联合学习平台(FL)平台已取得了显着的进展,以提供开发基础架构,但这些平台可能无法很好地应对各种异质性带来的挑战,包括参与者本地数据,资源,行为和学习目标中的异质性。为了填补这一空白,在本文中,我们提出了一个名为FederatedScope的新型FL平台,该平台采用事件驱动的架构为用户提供极大的灵活性,以独立描述不同参与者的行为。这样的设计使用户可以轻松地描述参与者具有各种本地培训过程,学习目标和后端,并通过同步或异步培训策略将其协调为FL课程。 FederatedScope为易于使用和灵活的平台提供了丰富类型的插入操作和组件,以有效地进行进一步开发,并且我们实施了几个重要组件,以更好地帮助用户进行隐私保护,攻击模拟和自动调整。我们已经在https://github.com/alibaba/federatedscope上发布了FederatedScope,以在各种情况下促进联邦学习的学术研究和工业部署。
translated by 谷歌翻译
我们为AI驱动数据库提供了一个SYSML框架。使用Baihe,可能会改装现有的关系数据库系统以使用学习组件进行查询优化或其他常见任务,例如例如,学习索引结构。为确保Baihe的实用性和现实世界适用性,其高级架构基于以下要求:与核心系统的分离,最小的第三方依赖,鲁棒性,稳定性和容错,以及稳定性和可配置性。基于高级架构,我们将描述Baihe的具体实现PostgreSQL,并为学习查询优化器提供了实例使用情况。为了服务于从业者,以及DB和AI4DB社区的研究人员将在开源许可下发布PostgreSQL的Baihe。
translated by 谷歌翻译
基数估计(Cardest)是查询优化器的中央组件,在生成DBMS中的高质量查询计划方面发挥着重要作用。使用传统和ML增强的方法,在过去几十年中,在过去几十年中已经广泛研究了Cardest问题。虽然,Cardest中最困难的问题,即如何在多个表上估算连接查询大小,尚未得到广泛解决。目前的方法要么回复独立假设,要么用沉重的负担应用技术,其性能仍然远非令人满意。更糟糕的是,现有的卡最多的卡片通常旨在优化一个目标,即推理速度或估计准确性,这不能适应不同的场合。在本文中,我们提出了一个非常一般的框架,称为胶水,以解决这些挑战。其关键的想法是在不同表格中优雅地解耦并无损合并单个表卡最大的结果,以估计加入查询大小。胶水支持使用任何现有的Cardest方法获取单个表格明智的Cardest结果,可以处理任何复杂的连接模式。因此,它很容易适应具有不同性能要求的不同场景,即,OLTP具有快速估计时间或OLAP,具有高估计精度。同时,我们显示胶水可以无缝集成到计划搜索过程中,并能够支持计算不同数量的值。所有这些属性都表现出在现实世界DBMS中部署胶水的潜在进步。
translated by 谷歌翻译
受到深入学习的巨大成功通过云计算和边缘芯片的快速发展的影响,人工智能研究(AI)的研究已经转移到计算范例,即云计算和边缘计算。近年来,我们目睹了在云服务器上开发更高级的AI模型,以超越传统的深度学习模型,以造成模型创新(例如,变压器,净化家庭),训练数据爆炸和飙升的计算能力。但是,边缘计算,尤其是边缘和云协同计算,仍然在其初期阶段,因为由于资源受限的IOT场景,因此由于部署了非常有限的算法而导致其成功。在本调查中,我们对云和边缘AI进行系统审查。具体而言,我们是第一个设置云和边缘建模的协作学习机制,通过彻底的审查使能够实现这种机制的架构。我们还讨论了一些正在进行的先进EDGE AI主题的潜在和实践经验,包括预先训练模型,图形神经网络和加强学习。最后,我们讨论了这一领域的有希望的方向和挑战。
translated by 谷歌翻译