临床数据通常由于其高度机密性而无法自由分发,这阻碍了医疗保健领域的机器学习的发展。缓解此问题的一种方法是使用生成对抗网络(GAN)生成现实的合成数据集。然而,已知甘恩会遭受模式崩溃的困扰,从而产生低脱水量的输出。在本文中,我们扩展了经典的GAN设置,并具有外部内存,以重播真实样品的功能。使用抗逆转录病毒治疗作为人类免疫缺陷病毒(艾滋病毒的ART)作为案例研究,我们表明我们的扩展设置增加了收敛性,更重要的是,它有效地捕获了现实世界中临床数据常见的严重类别不平衡分布。
translated by 谷歌翻译
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to review the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning.
translated by 谷歌翻译
尽管在文本,图像和视频上生成的对抗网络(GAN)取得了显着的成功,但由于一些独特的挑战,例如捕获不平衡数据中的依赖性,因此仍在开发中,生成高质量的表格数据仍在开发中,从而优化了合成患者数据的质量。保留隐私。在本文中,我们提出了DP-CGAN,这是一个由数据转换,采样,条件和网络培训组成的差异私有条件GAN框架,以生成现实且具有隐私性的表格数据。 DP-Cgans区分分类和连续变量,并将它们分别转换为潜在空间。然后,我们将条件矢量构建为附加输入,不仅在不平衡数据中介绍少数族裔类,还可以捕获变量之间的依赖性。我们将统计噪声注入DP-CGAN的网络训练过程中的梯度,以提供差异隐私保证。我们通过统计相似性,机器学习绩效和隐私测量值在三个公共数据集和两个现实世界中的个人健康数据集上使用最先进的生成模型广泛评估了我们的模型。我们证明,我们的模型优于其他可比模型,尤其是在捕获变量之间的依赖性时。最后,我们在合成数据生成中介绍了数据实用性与隐私之间的平衡,考虑到现实世界数据集的不同数据结构和特征,例如不平衡变量,异常分布和数据的稀疏性。
translated by 谷歌翻译
In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. In this paper, we discuss the use of Generative Adversarial Networks (GANs) for generating tabular data that can be employed in AQP for synopsis construction. We first discuss the challenges associated with constructing synopses in relational databases and then introduce solutions to those challenges. Following that, we organized statistical metrics to evaluate the quality of the generated synopses. We conclude that tabular data complexity makes it difficult for algorithms to understand relational database semantics during training, and improved versions of tabular GANs are capable of constructing synopses to revolutionize data-driven decision-making systems.
translated by 谷歌翻译
与CNN的分类,分割或对象检测相比,生成网络的目标和方法根本不同。最初,它们不是作为图像分析工具,而是生成自然看起来的图像。已经提出了对抗性训练范式来稳定生成方法,并已被证明是非常成功的 - 尽管绝不是第一次尝试。本章对生成对抗网络(GAN)的动机进行了基本介绍,并通​​过抽象基本任务和工作机制并得出了早期实用方法的困难来追溯其成功的道路。将显示进行更稳定的训练方法,也将显示出不良收敛及其原因的典型迹象。尽管本章侧重于用于图像生成和图像分析的gan,但对抗性训练范式本身并非特定于图像,并且在图像分析中也概括了任务。在将GAN与最近进入场景的进一步生成建模方法进行对比之前,将闻名图像语义分割和异常检测的架构示例。这将允许对限制的上下文化观点,但也可以对gans有好处。
translated by 谷歌翻译
数据质量是发展医疗保健中值得信赖的AI的关键因素。大量具有控制混杂因素的策划数据集可以帮助提高下游AI算法的准确性,鲁棒性和隐私性。但是,访问高质量的数据集受数据获取的技术难度的限制,并且严格的道德限制阻碍了医疗保健数据的大规模共享。数据合成算法生成具有与真实临床数据相似的分布的数据,可以作为解决可信度AI的发展过程中缺乏优质数据的潜在解决方案。然而,最新的数据合成算法,尤其是深度学习算法,更多地集中于成像数据,同时忽略了非成像医疗保健数据的综合,包括临床测量,医疗信号和波形以及电子保健记录(EHRS)(EHRS) 。因此,在本文中,我们将回顾合成算法,尤其是对于非成像医学数据,目的是在该领域提供可信赖的AI。本教程风格的审查论文将对包括算法,评估,局限性和未来研究方向在内的各个方面进行全面描述。
translated by 谷歌翻译
Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.
translated by 谷歌翻译
合成健康数据在共享数据以支持生物医学研究和创新医疗保健应用的发展时有可能减轻隐私问题。基于机器学习,尤其是生成对抗网络(GAN)方法的现代方法生成的现代方法继续发展并表现出巨大的潜力。然而,缺乏系统的评估框架来基准测试方法,并确定哪些方法最合适。在这项工作中,我们引入了一个可推广的基准测试框架,以评估综合健康数据的关键特征在实用性和隐私指标方面。我们将框架应用框架来评估来自两个大型学术医疗中心的电子健康记录(EHRS)数据的合成数据生成方法。结果表明,共享合成EHR数据存在公用事业私人关系权衡。结果进一步表明,在每个用例中,在所有标准上都没有明确的方法是最好的,这使得为什么需要在上下文中评估合成数据生成方法。
translated by 谷歌翻译
从文本描述中综合现实图像是计算机视觉中的主要挑战。当前对图像合成方法的文本缺乏产生代表文本描述符的高分辨率图像。大多数现有的研究都依赖于生成的对抗网络(GAN)或变异自动编码器(VAE)。甘斯具有产生更清晰的图像的能力,但缺乏输出的多样性,而VAE擅长生产各种输出,但是产生的图像通常是模糊的。考虑到gan和vaes的相对优势,我们提出了一个新的有条件VAE(CVAE)和条件gan(CGAN)网络架构,用于合成以文本描述为条件的图像。这项研究使用条件VAE作为初始发电机来生成文本描述符的高级草图。这款来自第一阶段的高级草图输出和文本描述符被用作条件GAN网络的输入。第二阶段GAN产生256x256高分辨率图像。所提出的体系结构受益于条件加强和有条件的GAN网络的残留块,以实现结果。使用CUB和Oxford-102数据集进行了多个实验,并将所提出方法的结果与Stackgan等最新技术进行了比较。实验表明,所提出的方法生成了以文本描述为条件的高分辨率图像,并使用两个数据集基于Inception和Frechet Inception评分产生竞争结果
translated by 谷歌翻译
时间序列数据生成近年来越来越受到关注。已经提出了几种生成的对抗网络(GaN)的方法通常是假设目标时间序列数据良好格式化并完成的假设来解决问题。然而,现实世界时间序列(RTS)数据远离该乌托邦,例如,具有可变长度的长序列和信息缺失数据,用于设计强大的发电算法的棘手挑战。在本文中,我们向RTS数据提出了一种新的生成框架 - RTSGAN来解决上述挑战。 RTSGAN首先学习编码器 - 解码器模块,该模块提供时间序列实例和固定维度潜在载体之间的映射,然后学习生成模块以在同一潜在空间中生成vectors。通过组合发电机和解码器,RTSGAN能够生成尊重原始特征分布和时间动态的RTS。为了生成具有缺失值的时间序列,我们进一步用观察嵌入层和决定和生成解码器装备了RTSGAN,以更好地利用信息缺失模式。四个RTS数据集上的实验表明,该框架在用于下游分类和预测任务的合成数据实用程序方面优于前一代方法。
translated by 谷歌翻译
为生成模型设计域和模型不合稳定的评估指标是一个重要且尚未解决的问题。大多数仅根据图像合成设置量身定制的指标表现出有限的能力,可以诊断跨更广泛的应用域的生成模型的不同模式。在本文中,我们介绍了三维评估度量标准($ \ alpha $ - precision,$ \ beta $ - recall,autherticity),其特征是任何生成模型中任何生成模型的保真度,多样性和概括性的表征。我们的度量标准通过精确重新分析统一统计差异度量,从而实现了模型保真度和多样性的样本和分布级诊断。我们将概括作为额外的独立维度(对忠诚度多样性权衡取舍),该概括量化了模型复制培训数据的程度 - 在对敏感数据建模具有隐私要求的敏感数据时,这是至关重要的绩效指标。这三个度量组件对应于(可解释的)概率数量,并通过样品级二进制分类估算。我们指标的样本级别的性质激发了一种新颖的用例,我们称之为模型审核,其中我们判断(Black-Box)模型生成的单个样品的质量,丢弃了低质量样品,从而改善了整体模型性能事后方式。
translated by 谷歌翻译
Modeling lies at the core of both the financial and the insurance industry for a wide variety of tasks. The rise and development of machine learning and deep learning models have created many opportunities to improve our modeling toolbox. Breakthroughs in these fields often come with the requirement of large amounts of data. Such large datasets are often not publicly available in finance and insurance, mainly due to privacy and ethics concerns. This lack of data is currently one of the main hurdles in developing better models. One possible option to alleviating this issue is generative modeling. Generative models are capable of simulating fake but realistic-looking data, also referred to as synthetic data, that can be shared more freely. Generative Adversarial Networks (GANs) is such a model that increases our capacity to fit very high-dimensional distributions of data. While research on GANs is an active topic in fields like computer vision, they have found limited adoption within the human sciences, like economics and insurance. Reason for this is that in these fields, most questions are inherently about identification of causal effects, while to this day neural networks, which are at the center of the GAN framework, focus mostly on high-dimensional correlations. In this paper we study the causal preservation capabilities of GANs and whether the produced synthetic data can reliably be used to answer causal questions. This is done by performing causal analyses on the synthetic data, produced by a GAN, with increasingly more lenient assumptions. We consider the cross-sectional case, the time series case and the case with a complete structural model. It is shown that in the simple cross-sectional scenario where correlation equals causation the GAN preserves causality, but that challenges arise for more advanced analyses.
translated by 谷歌翻译
Generating multivariate time series is a promising approach for sharing sensitive data in many medical, financial, and IoT applications. A common type of multivariate time series originates from a single source such as the biometric measurements from a medical patient. This leads to complex dynamical patterns between individual time series that are hard to learn by typical generation models such as GANs. There is valuable information in those patterns that machine learning models can use to better classify, predict or perform other downstream tasks. We propose a novel framework that takes time series' common origin into account and favors channel/feature relationships preservation. The two key points of our method are: 1) the individual time series are generated from a common point in latent space and 2) a central discriminator favors the preservation of inter-channel/feature dynamics. We demonstrate empirically that our method helps preserve channel/feature correlations and that our synthetic data performs very well in downstream tasks with medical and financial data.
translated by 谷歌翻译
我们提出了一种具有多个鉴别器的生成的对抗性网络,其中每个鉴别者都专门用于区分真实数据集的子集。这种方法有助于学习与底层数据分布重合的发电机,从而减轻慢性模式崩溃问题。从多项选择学习的灵感来看,我们引导每个判别者在整个数据的子集中具有专业知识,并允许发电机在没有监督训练示例和鉴别者的数量的情况下自动找到潜伏和真实数据空间之间的合理对应关系。尽管使用多种鉴别器,但骨干网络在鉴别器中共享,并且培训成本的增加最小化。我们使用多个评估指标展示了我们算法在标准数据集中的有效性。
translated by 谷歌翻译
最近在时间序列域中的合成数据生成的工作集中在使用生成的对抗网络。我们提出了一种用于综合生成时间序列数据的新型架构,使用变分自动编码器(VAES)。拟议的架构具有多种不同的特性:可解释性,编码域知识的能力,以及减少培训时间。我们通过对四个多变量数据集的相似性和可预测性评估数据生成质量。我们试验不同尺寸的培训数据,以测量数据可用性对我们VAE方法的产生质量的影响以及几种最先进的数据生成方法。我们对相似​​性测试的结果表明,VAE方法能够准确地代表原始数据的时间属性。在使用生成数据的下一步预测任务上,所提出的VAE架构一致地满足或超过最先进的数据生成方法的性能。虽然降噪可能导致所生成的数据偏离原始数据,但是我们演示了所产生的去噪数据可以使用生成的数据显着提高下一步预测的性能。最后,所提出的架构可以包含域特定的时间模式,例如多项式趋势和季节性,以提供可解释的输出。这种解释性在需要模型输出的透明度的应用中可以是非常有利的,或者用户希望将时间序列模式的先验知识注入到生成模型中。
translated by 谷歌翻译
最近,诸如Interovae和S-Introvae之类的内省模型在图像生成和重建任务方面表现出色。内省模型的主要特征是对VAE的对抗性学习,编码器试图区分真实和假(即合成)图像。但是,由于有效度量标准无法评估真实图像和假图像之间的差异,因此后塌陷和消失的梯度问题仍然存在,从而降低了合成图像的保真度。在本文中,我们提出了一种称为对抗性相似性距离内省变化自动编码器(AS-Introvae)的新变体。我们理论上分析了消失的梯度问题,并使用2-Wasserstein距离和内核技巧构建了新的对抗相似性距离(AS-cantance)。随着重量退火,AS-Introvae能够产生稳定和高质量的图像。通过每批次尝试转换图像,以使其更好地适合潜在空间中的先前分布,从而解决了后塌陷问题。与每个图像方法相比,该策略促进了潜在空间中更多样化的分布,从而使我们的模型能够产生巨大的多样性图像。基准数据集的全面实验证明了AS-Introvae对图像生成和重建任务的有效性。
translated by 谷歌翻译
We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon β-VAE by providing a better trade-off between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.
translated by 谷歌翻译
Generative adversarial networks (GANs) provide a way to learn deep representations without extensively annotated training data. They achieve this through deriving backpropagation signals through a competitive process involving a pair of networks. The representations that can be learned by GANs may be used in a variety of applications, including image synthesis, semantic image editing, style transfer, image super-resolution and classification. The aim of this review paper is to provide an overview of GANs for the signal processing community, drawing on familiar analogies and concepts where possible. In addition to identifying different methods for training and constructing GANs, we also point to remaining challenges in their theory and application.
translated by 谷歌翻译
医疗数据集通常由噪声和缺失数据损坏。这些缺失的模式通常被认为是完全随机的,而是在医学场景中,现实是,这些模式由于在一些时间或数据被收集的不alaled的不均匀方式中被收集的传感器而发生突发。本文建议使用异构数据类型和使用顺序变化自动码器(VAES)来模拟医疗数据记录和突发的缺失数据。特别是,我们提出了一种新的方法,SHI-VAE,其扩展了VAE的能力,使VAE的顺序数据流缺失了观察。我们将我们的模型与精密护理单元数据库(ICU)中的最先进的解决方案进行比较和被动人类监测的数据集。此外,我们发现诸如RMSE的标准错误指标不能得出足够的决定性,以评估时间模型,并包括在我们分析地面真理和算中信号之间的互相关。我们表明Shi-VAE在使用两个指标方面实现了最佳性能,而不是GP-VAE模型的计算复杂性较低,这是用于医疗记录的最先进的方法。
translated by 谷歌翻译
生成的对抗网络(GANS)正在增加对综合数据的手段的关注。到目前为止,这项工作已被应用于在数据机密域之外的用例,具有共同的应用程序作为人工图像的生产。在这里,我们考虑了GAN的潜在应用,以产生合成人口普查Microdata。我们使用电池电量和披露风险指标(目标正确的归因概率),以比较用使用正统数据合成方法生产的表格GAN产生的数据。
translated by 谷歌翻译