从移动设备收集的位置数据代表个人和社会水平的移动性行为。这些数据具有从运输计划到疫情建模的重要应用。但是,必须克服最佳服务的问题:数据通常代表有限的人口样本和数据危害隐私的数据。为了解决这些问题,我们展示并评估用于使用在实际位置数据上培训的深频复制神经网络(RNN)来生成合成移动数据的系统。该系统将群体分发作为输入,为相应的合成群生成移动性跟踪。相关的生成方法尚未解决在较长时间内捕获个人移动行为中的模式和变异性的挑战,同时还平衡了具有隐私的现实数据的产生。我们的系统利用RNNS的能力生成复杂和新序列的能力,同时保留训练数据的模式。此外,该模型引入了用于校准各个级别的合成和实际数据之间的变化的随机性。这是捕获人类移动性的可变性,并保护用户隐私。基于位置的服务(LBS)来自22,700多种移动设备的数据用于实用程序和隐私度量的实验评估。我们示出了生成的移动数据保留了实际数据的特征,同时从个人级别的实际数据变化,并且在此变化量匹配真实数据内的变化。
translated by 谷歌翻译
“轨迹”是指由地理空间中的移动物体产生的迹线,通常由一系列按时间顺序排列的点表示,其中每个点由地理空间坐标集和时间戳组成。位置感应和无线通信技术的快速进步使我们能够收集和存储大量的轨迹数据。因此,许多研究人员使用轨迹数据来分析各种移动物体的移动性。在本文中,我们专注于“城市车辆轨迹”,这是指城市交通网络中车辆的轨迹,我们专注于“城市车辆轨迹分析”。城市车辆轨迹分析提供了前所未有的机会,可以了解城市交通网络中的车辆运动模式,包括以用户为中心的旅行经验和系统范围的时空模式。城市车辆轨迹数据的时空特征在结构上相互关联,因此,许多先前的研究人员使用了各种方法来理解这种结构。特别是,由于其强大的函数近似和特征表示能力,深度学习模型是由于许多研究人员的注意。因此,本文的目的是开发基于深度学习的城市车辆轨迹分析模型,以更好地了解城市交通网络的移动模式。特别是,本文重点介绍了两项研究主题,具有很高的必要性,重要性和适用性:下一个位置预测,以及合成轨迹生成。在这项研究中,我们向城市车辆轨迹分析提供了各种新型模型,使用深度学习。
translated by 谷歌翻译
Efficient energy consumption is crucial for achieving sustainable energy goals in the era of climate change and grid modernization. Thus, it is vital to understand how energy is consumed at finer resolutions such as household in order to plan demand-response events or analyze the impacts of weather, electricity prices, electric vehicles, solar, and occupancy schedules on energy consumption. However, availability and access to detailed energy-use data, which would enable detailed studies, has been rare. In this paper, we release a unique, large-scale, synthetic, residential energy-use dataset for the residential sector across the contiguous United States covering millions of households. The data comprise of hourly energy use profiles for synthetic households, disaggregated into Thermostatically Controlled Loads (TCL) and appliance use. The underlying framework is constructed using a bottom-up approach. Diverse open-source surveys and first principles models are used for end-use modeling. Extensive validation of the synthetic dataset has been conducted through comparisons with reported energy-use data. We present a detailed, open, high-resolution, residential energy-use dataset for the United States.
translated by 谷歌翻译
随着移动设备和基于位置的服务越来越多地在不同的智能城市场景和应用程序中开发,由于数据收集和共享,许多意外的隐私泄漏已经出现。当与云辅助应用程序共享地理位置数据时,用户重新识别和其他敏感的推论是主要的隐私威胁。值得注意的是,四个时空点足以唯一地识别95%的个人,这加剧了个人信息泄漏。为了解决诸如用户重新识别之类的恶意目的,我们提出了一种基于LSTM的对抗机制,具有代表性学习,以实现原始地理位置数据(即移动性数据)的隐私权特征表示,以共享目的。这些表示旨在以最小的公用事业预算(即损失)最大程度地减少用户重新识别和完整数据重建的机会。我们通过量化轨迹重建风险,用户重新识别风险和移动性可预测性来量化移动性数据集的隐私性权衡权衡来训练该机制。我们报告了探索性分析,使用户能够通过特定的损失功能及其权重参数评估此权衡。四个代表性移动数据集的广泛比较结果证明了我们提出的在移动性隐私保护方面的架构的优越性以及提议的隐私权提取器提取器的效率。我们表明,流动痕迹的隐私能够以边际移动公用事业为代价获得体面的保护。我们的结果还表明,通过探索帕累托最佳设置,我们可以同时增加隐私(45%)和实用程序(32%)。
translated by 谷歌翻译
尽管在文本,图像和视频上生成的对抗网络(GAN)取得了显着的成功,但由于一些独特的挑战,例如捕获不平衡数据中的依赖性,因此仍在开发中,生成高质量的表格数据仍在开发中,从而优化了合成患者数据的质量。保留隐私。在本文中,我们提出了DP-CGAN,这是一个由数据转换,采样,条件和网络培训组成的差异私有条件GAN框架,以生成现实且具有隐私性的表格数据。 DP-Cgans区分分类和连续变量,并将它们分别转换为潜在空间。然后,我们将条件矢量构建为附加输入,不仅在不平衡数据中介绍少数族裔类,还可以捕获变量之间的依赖性。我们将统计噪声注入DP-CGAN的网络训练过程中的梯度,以提供差异隐私保证。我们通过统计相似性,机器学习绩效和隐私测量值在三个公共数据集和两个现实世界中的个人健康数据集上使用最先进的生成模型广泛评估了我们的模型。我们证明,我们的模型优于其他可比模型,尤其是在捕获变量之间的依赖性时。最后,我们在合成数据生成中介绍了数据实用性与隐私之间的平衡,考虑到现实世界数据集的不同数据结构和特征,例如不平衡变量,异常分布和数据的稀疏性。
translated by 谷歌翻译
Accurate activity location prediction is a crucial component of many mobility applications and is particularly required to develop personalized, sustainable transportation systems. Despite the widespread adoption of deep learning models, next location prediction models lack a comprehensive discussion and integration of mobility-related spatio-temporal contexts. Here, we utilize a multi-head self-attentional (MHSA) neural network that learns location transition patterns from historical location visits, their visit time and activity duration, as well as their surrounding land use functions, to infer an individual's next location. Specifically, we adopt point-of-interest data and latent Dirichlet allocation for representing locations' land use contexts at multiple spatial scales, generate embedding vectors of the spatio-temporal features, and learn to predict the next location with an MHSA network. Through experiments on two large-scale GNSS tracking datasets, we demonstrate that the proposed model outperforms other state-of-the-art prediction models, and reveal the contribution of various spatio-temporal contexts to the model's performance. Moreover, we find that the model trained on population data achieves higher prediction performance with fewer parameters than individual-level models due to learning from collective movement patterns. We also reveal mobility conducted in the recent past and one week before has the largest influence on the current prediction, showing that learning from a subset of the historical mobility is sufficient to obtain an accurate location prediction result. We believe that the proposed model is vital for context-aware mobility prediction. The gained insights will help to understand location prediction models and promote their implementation for mobility applications.
translated by 谷歌翻译
合成健康数据在共享数据以支持生物医学研究和创新医疗保健应用的发展时有可能减轻隐私问题。基于机器学习,尤其是生成对抗网络(GAN)方法的现代方法生成的现代方法继续发展并表现出巨大的潜力。然而,缺乏系统的评估框架来基准测试方法,并确定哪些方法最合适。在这项工作中,我们引入了一个可推广的基准测试框架,以评估综合健康数据的关键特征在实用性和隐私指标方面。我们将框架应用框架来评估来自两个大型学术医疗中心的电子健康记录(EHRS)数据的合成数据生成方法。结果表明,共享合成EHR数据存在公用事业私人关系权衡。结果进一步表明,在每个用例中,在所有标准上都没有明确的方法是最好的,这使得为什么需要在上下文中评估合成数据生成方法。
translated by 谷歌翻译
蜂窝提供商和数据聚合公司从用户设备中占群体的Celluar信号强度测量以生成信号映射,可用于提高网络性能。认识到这种数据收集可能与越来越多的隐私问题的认识可能存在赔率,我们考虑在数据离开移动设备之前混淆这些数据。目标是提高隐私,使得难以从混淆的数据(例如用户ID和用户行踪)中恢复敏感功能,同时仍然允许网络提供商使用用于改进网络服务的数据(即创建准确的信号映射)。要检查本隐私实用程序权衡,我们识别适用于信号强度测量的隐私和公用事业度量和威胁模型。然后,我们使用几种卓越的技术,跨越差异隐私,生成的对抗性隐私和信息隐私技术进行了衡量测量,以便基准,以基准获得各种有前景的混淆方法,并为真实世界的工程师提供指导,这些工程师是负责构建信号映射的现实工程师在不伤害效用的情况下保护隐私。我们的评估结果基于多个不同的现实世界信号映射数据集,展示了同时实现了充足的隐私和实用程序的可行性,并使用了使用该结构和预期使用数据集的策略以及目标平均案例的策略,而不是最坏的情况,保证。
translated by 谷歌翻译
The proliferation of smartphones has accelerated mobility studies by largely increasing the type and volume of mobility data available. One such source of mobility data is from GPS technology, which is becoming increasingly common and helps the research community understand mobility patterns of people. However, there lacks a standardized framework for studying the different mobility patterns created by the non-Work, non-Home locations of Working and Nonworking users on Workdays and Offdays using machine learning methods. We propose a new mobility metric, Daily Characteristic Distance, and use it to generate features for each user together with Origin-Destination matrix features. We then use those features with an unsupervised machine learning method, $k$-means clustering, and obtain three clusters of users for each type of day (Workday and Offday). Finally, we propose two new metrics for the analysis of the clustering results, namely User Commonality and Average Frequency. By using the proposed metrics, interesting user behaviors can be discerned and it helps us to better understand the mobility patterns of the users.
translated by 谷歌翻译
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to review the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning.
translated by 谷歌翻译
在本文中,我们使用机器学习,概率和基于重力的方法的组合来提出一种用于为更大的墨尔本地区创建合成群体的算法。我们将这些技术与三个主要创新的混合模型相结合:1。分配活动模式时,我们为每个代理商生成各个活动链,对其队列量身定制; 2.选择目的地时,我们的目标是在旅行长度和目的地的基于活动的景点之间取得平衡; 3.我们考虑到代理人剩余的旅行数量,以确保他们不选择不合理的目的地以退回家庭。我们的方法是完全打开和可复制的,只需要公开的数据来生成与常用代理的建模软件兼容的合成代理商,例如Matsim。在各种人口尺寸的距离分布,模式选择和目的地选择方面,发现合成群是准确的。
translated by 谷歌翻译
数据质量是发展医疗保健中值得信赖的AI的关键因素。大量具有控制混杂因素的策划数据集可以帮助提高下游AI算法的准确性,鲁棒性和隐私性。但是,访问高质量的数据集受数据获取的技术难度的限制,并且严格的道德限制阻碍了医疗保健数据的大规模共享。数据合成算法生成具有与真实临床数据相似的分布的数据,可以作为解决可信度AI的发展过程中缺乏优质数据的潜在解决方案。然而,最新的数据合成算法,尤其是深度学习算法,更多地集中于成像数据,同时忽略了非成像医疗保健数据的综合,包括临床测量,医疗信号和波形以及电子保健记录(EHRS)(EHRS) 。因此,在本文中,我们将回顾合成算法,尤其是对于非成像医学数据,目的是在该领域提供可信赖的AI。本教程风格的审查论文将对包括算法,评估,局限性和未来研究方向在内的各个方面进行全面描述。
translated by 谷歌翻译
我们考虑从多个移动设备收集的测量预测蜂窝网络性能(信号映射)的问题。我们制定在线联合学习框架内的问题:(i)联合学习(FL)使用户能够协作培训模型,同时保持其培训数据; (ii)由于用户移动随着时间的推移,并且用于以在线方式用于本地培训,因此收集测量。我们考虑一个诚实但很好的服务器,他们使用梯度(DLG)类型的攻击深泄漏来观察来自目标用户的更新,并使用深度泄漏(DLG)类型的攻击,最初开发的是重建DNN图像分类器的训练数据。我们使应用于我们的设置的DLG攻击的关键观察,Infers Infers Infers批次的本地数据的平均位置,因此可以用于以粗糙粒度重建目标用户的轨迹。我们表明,已经通过梯度的平均来提供适度的隐私保护,这是联合平均所固有的。此外,我们提出了一种算法,该算法可以在本地应用,以策划用于本地更新的批次,以便在不伤害实用程序的情况下有效保护其位置隐私。最后,我们表明,参与FL的多个用户的效果取决于其轨迹的相似性。据我们所知,这是第一次研究DLG攻击在众群时空数据的环境中。
translated by 谷歌翻译
We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model's training dataset. To perform membership inference against a target model, we make adversarial use of machine learning and train our own inference model to recognize differences in the target model's predictions on the inputs that it trained on versus the inputs that it did not train on.We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.
translated by 谷歌翻译
机器学习的最新进展使其在不同领域的广泛应用程序,最令人兴奋的应用程序之一是自动驾驶汽车(AV),这鼓励了从感知到预测到计划的许多ML算法的开发。但是,培训AV通常需要从不同驾驶环境(例如城市)以及不同类型的个人信息(例如工作时间和路线)收集的大量培训数据。这种收集的大数据被视为以数据为中心的AI时代的ML新油,通常包含大量对隐私敏感的信息,这些信息很难删除甚至审核。尽管现有的隐私保护方法已经取得了某些理论和经验成功,但将它们应用于自动驾驶汽车等现实世界应用时仍存在差距。例如,当培训AVS时,不仅可以单独识别的信息揭示对隐私敏感的信息,还可以揭示人口级别的信息,例如城市内的道路建设以及AVS的专有商业秘密。因此,重新审视AV中隐私风险和相应保护方法的前沿以弥合这一差距至关重要。遵循这一目标,在这项工作中,我们为AVS中的隐私风险和保护方法提供了新的分类法,并将AV中的隐私分为三个层面:个人,人口和专有。我们明确列出了保护每个级别的隐私级别,总结这些挑战的现有解决方案,讨论课程和结论,并为研究人员和从业者提供潜在的未来方向和机会。我们认为,这项工作将有助于塑造AV中的隐私研究,并指导隐私保护技术设计。
translated by 谷歌翻译
Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to aggregated data (macro data) sources. In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data by combining information from multiple easier-to-obtain lower-resolution data sources. In particular, we introduce a framework that uses a combination of univariate and multivariate frequency tables from a given target geographical location in combination with frequency tables from other auxiliary locations to generate synthetic microdata for individuals in the target location. Our method combines the estimation of a dependency graph and conditional probabilities from the target location with the use of a Gaussian copula to leverage the available information from the auxiliary locations. We perform extensive testing on two real-world datasets and demonstrate that our approach outperforms prior approaches in preserving the overall dependency structure of the data while also satisfying the constraints defined on the different variables.
translated by 谷歌翻译
深度神经网络在人类分析中已经普遍存在,增强了应用的性能,例如生物识别识别,动作识别以及人重新识别。但是,此类网络的性能通过可用的培训数据缩放。在人类分析中,对大规模数据集的需求构成了严重的挑战,因为数据收集乏味,廉价,昂贵,并且必须遵守数据保护法。当前的研究研究了\ textit {合成数据}的生成,作为在现场收集真实数据的有效且具有隐私性的替代方案。这项调查介绍了基本定义和方法,在生成和采用合成数据进行人类分析时必不可少。我们进行了一项调查,总结了当前的最新方法以及使用合成数据的主要好处。我们还提供了公开可用的合成数据集和生成模型的概述。最后,我们讨论了该领域的局限性以及开放研究问题。这项调查旨在为人类分析领域的研究人员和从业人员提供。
translated by 谷歌翻译
In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. In this paper, we discuss the use of Generative Adversarial Networks (GANs) for generating tabular data that can be employed in AQP for synopsis construction. We first discuss the challenges associated with constructing synopses in relational databases and then introduce solutions to those challenges. Following that, we organized statistical metrics to evaluate the quality of the generated synopses. We conclude that tabular data complexity makes it difficult for algorithms to understand relational database semantics during training, and improved versions of tabular GANs are capable of constructing synopses to revolutionize data-driven decision-making systems.
translated by 谷歌翻译
This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models-a common type of machine-learning model. Because such models are sometimes trained on sensitive data (e.g., the text of users' private messages), this methodology can benefit privacy by allowing deep-learning practitioners to select means of training that minimize such memorization.In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, for models trained without consideration of memorization, we describe new, efficient procedures that can extract unique, secret sequences, such as credit card numbers. We show that our testing strategy is a practical and easy-to-use first line of defense, e.g., by describing its application to quantitatively limit data exposure in Google's Smart Compose, a commercial text-completion neural network trained on millions of users' email messages.
translated by 谷歌翻译
机构收集了大量的学习痕迹,但他们可能不会出于隐私问题披露它。合成数据生成为教育研究开辟了新的机会。在本文中,我们提出了一个可以保留参与者隐私的教育数据的生成模型,以及比较合成数据生成器的评估框架。我们展示了幼稚的假名如何导致重新识别威胁并提出保证隐私的技术。我们评估了现有大规模教育开放数据集的方法。
translated by 谷歌翻译