With water quality management processes, identifying and interpreting relationships between features, such as location and weather variable tuples, and water quality variables, such as levels of bacteria, is key to gaining insights and identifying areas where interventions should be made. There is a need for a search process to identify the locations and types of phenomena that are influencing water quality and a need to explain why the quality is being affected and which factors are most relevant. This paper addresses both of these issues through the development of a process for collecting data for features that represent a variety of variables over a spatial region, which are used for training and inference, and analysing the performance of the features using the model and Shapley values. Shapley values originated in cooperative game theory and can be used to aid in the interpretation of machine learning results. Evaluations are performed using several machine learning algorithms and water quality data from the Dublin Grand Canal basin.
translated by 谷歌翻译
PV power forecasting models are predominantly based on machine learning algorithms which do not provide any insight into or explanation about their predictions (black boxes). Therefore, their direct implementation in environments where transparency is required, and the trust associated with their predictions may be questioned. To this end, we propose a two stage probabilistic forecasting framework able to generate highly accurate, reliable, and sharp forecasts yet offering full transparency on both the point forecasts and the prediction intervals (PIs). In the first stage, we exploit natural gradient boosting (NGBoost) for yielding probabilistic forecasts, while in the second stage, we calculate the Shapley additive explanation (SHAP) values in order to fully comprehend why a prediction was made. To highlight the performance and the applicability of the proposed framework, real data from two PV parks located in Southern Germany are employed. Comparative results with two state-of-the-art algorithms, namely Gaussian process and lower upper bound estimation, manifest a significant increase in the point forecast accuracy and in the overall probabilistic performance. Most importantly, a detailed analysis of the model's complex nonlinear relationships and interaction effects between the various features is presented. This allows interpreting the model, identifying some learned physical properties, explaining individual predictions, reducing the computational requirements for the training without jeopardizing the model accuracy, detecting possible bugs, and gaining trust in the model. Finally, we conclude that the model was able to develop complex nonlinear relationships which follow known physical properties as well as human logic and intuition.
translated by 谷歌翻译
预测经济的短期动态 - 对经济代理商决策过程的重要意见 - 经常在线性模型中使用滞后指标。这通常在正常时期就足够了,但在危机期间可能不足。本文旨在证明,在非线性机器学习方法的帮助下,非传统和及时的数据(例如零售和批发付款)可以为决策者提供复杂的模型,以准确地估算几乎实时的关键宏观经济指标。此外,我们提供了一组计量经济学工具,以减轻机器学习模型中的过度拟合和解释性挑战,以提高其政策使用的有效性。我们的模型具有付款数据,非线性方法和量身定制的交叉验证方法,有助于提高宏观经济的启示准确性高达40 \% - 在COVID-19期间的增长较高。我们观察到,付款数据对经济预测的贡献很小,在低和正常增长期间是线性的。但是,在强年或正增长期间,付款数据的贡献很大,不对称和非线性。
translated by 谷歌翻译
基于Shapley值的功能归因在解释机器学习模型中很受欢迎。但是,从理论和计算的角度来看,它们的估计是复杂的。我们将这种复杂性分解为两个因素:(1)〜删除特征信息的方法,以及(2)〜可拖动估计策略。这两个因素提供了一种天然镜头,我们可以更好地理解和比较24种不同的算法。基于各种特征删除方法,我们描述了多种类型的Shapley值特征属性和计算每个类型的方法。然后,基于可进行的估计策略,我们表征了两个不同的方法家族:模型 - 不合时宜的和模型特定的近似值。对于模型 - 不合稳定的近似值,我们基准了广泛的估计方法,并将其与Shapley值的替代性但等效的特征联系起来。对于特定于模型的近似值,我们阐明了对每种方法的线性,树和深模型的障碍至关重要的假设。最后,我们确定了文献中的差距以及有希望的未来研究方向。
translated by 谷歌翻译
提出了一种使用天气数据实时太阳生成预测的新方法,同时提出了既有空间结构依赖性的依赖。随着时间的推移,观察到的网络被预测到较低维度的表示,在该表示的情况下,在推理阶段使用天气预报时,使用各种天气测量来训练结构化回归模型。从国家太阳辐射数据库获得的德克萨斯州圣安东尼奥地区的288个地点进行了实验。该模型预测具有良好精度的太阳辐照度(夏季R2 0.91,冬季为0.85,全球模型为0.89)。随机森林回归者获得了最佳准确性。进行了多个实验来表征缺失数据的影响和不同的时间范围的影响,这些范围提供了证据表明,新算法不仅在随机的情况下,而且在机制是空间和时间上都丢失的数据是可靠的。
translated by 谷歌翻译
Shap是一种衡量机器学习模型中可变重要性的流行方法。在本文中,我们研究了用于估计外形评分的算法,并表明它是功能性方差分析分解的转换。我们使用此连接表明,在Shap近似中的挑战主要与选择功能分布的选择以及估计的$ 2^p $ ANOVA条款的数量有关。我们认为,在这种情况下,机器学习解释性和敏感性分析之间的联系是有照明的,但是直接的实际后果并不明显,因为这两个领域面临着不同的约束。机器学习的解释性问题模型可评估,但通常具有数百个(即使不是数千个)功能。敏感性分析通常处理物理或工程的模型,这些模型可能非常耗时,但在相对较小的输入空间上运行。
translated by 谷歌翻译
As of 2022, greenhouse gases (GHG) emissions reporting and auditing are not yet compulsory for all companies and methodologies of measurement and estimation are not unified. We propose a machine learning-based model to estimate scope 1 and scope 2 GHG emissions of companies not reporting them yet. Our model, specifically designed to be transparent and completely adapted to this use case, is able to estimate emissions for a large universe of companies. It shows good out-of-sample global performances as well as good out-of-sample granular performances when evaluating it by sectors, by countries or by revenues buckets. We also compare our results to those of other providers and find our estimates to be more accurate. Thanks to the proposed explainability tools using Shapley values, our model is fully interpretable, the user being able to understand which factors split explain the GHG emissions for each particular company.
translated by 谷歌翻译
Building an accurate model of travel behaviour based on individuals' characteristics and built environment attributes is of importance for policy-making and transportation planning. Recent experiments with big data and Machine Learning (ML) algorithms toward a better travel behaviour analysis have mainly overlooked socially disadvantaged groups. Accordingly, in this study, we explore the travel behaviour responses of low-income individuals to transit investments in the Greater Toronto and Hamilton Area, Canada, using statistical and ML models. We first investigate how the model choice affects the prediction of transit use by the low-income group. This step includes comparing the predictive performance of traditional and ML algorithms and then evaluating a transit investment policy by contrasting the predicted activities and the spatial distribution of transit trips generated by vulnerable households after improving accessibility. We also empirically investigate the proposed transit investment by each algorithm and compare it with the city of Brampton's future transportation plan. While, unsurprisingly, the ML algorithms outperform classical models, there are still doubts about using them due to interpretability concerns. Hence, we adopt recent local and global model-agnostic interpretation tools to interpret how the model arrives at its predictions. Our findings reveal the great potential of ML algorithms for enhanced travel behaviour predictions for low-income strata without considerably sacrificing interpretability.
translated by 谷歌翻译
As ride-hailing services become increasingly popular, being able to accurately predict demand for such services can help operators efficiently allocate drivers to customers, and reduce idle time, improve congestion, and enhance the passenger experience. This paper proposes UberNet, a deep learning Convolutional Neural Network for short-term prediction of demand for ride-hailing services. UberNet empploys a multivariate framework that utilises a number of temporal and spatial features that have been found in the literature to explain demand for ride-hailing services. The proposed model includes two sub-networks that aim to encode the source series of various features and decode the predicting series, respectively. To assess the performance and effectiveness of UberNet, we use 9 months of Uber pickup data in 2014 and 28 spatial and temporal features from New York City. By comparing the performance of UberNet with several other approaches, we show that the prediction quality of the model is highly competitive. Further, Ubernet's prediction performance is better when using economic, social and built environment features. This suggests that Ubernet is more naturally suited to including complex motivators in making real-time passenger demand predictions for ride-hailing services.
translated by 谷歌翻译
与经典的统计学习方法相比,机器和深度学习生存模型表现出相似甚至改进事件的预测能力,但太复杂了,无法被人类解释。有几种模型不合时宜的解释可以克服这个问题。但是,没有一个直接解释生存函数预测。在本文中,我们介绍了Survhap(t),这是第一个允许解释生存黑盒模型的解释。它基于Shapley添加性解释,其理论基础稳定,并在机器学习从业人员中广泛采用。拟议的方法旨在增强精确诊断和支持领域的专家做出决策。关于合成和医学数据的实验证实,survhap(t)可以检测具有时间依赖性效果的变量,并且其聚集是对变量对预测的重要性的决定因素,而不是存活。 survhap(t)是模型不可屈服的,可以应用于具有功能输出的所有型号。我们在http://github.com/mi2datalab/survshap中提供了python中时间相关解释的可访问实现。
translated by 谷歌翻译
COVID-19的大流行提出了对多个领域决策者的流行预测的重要性,从公共卫生到整个经济。虽然预测流行进展经常被概念化为类似于天气预测,但是它具有一些关键的差异,并且仍然是一项非平凡的任务。疾病的传播受到人类行为,病原体动态,天气和环境条件的多种混杂因素的影响。由于政府公共卫生和资助机构的倡议,捕获以前无法观察到的方面的丰富数据来源的可用性增加了研究的兴趣。这尤其是在“以数据为中心”的解决方案上进行的一系列工作,这些解决方案通过利用非传统数据源以及AI和机器学习的最新创新来增强我们的预测能力的潜力。这项调查研究了各种数据驱动的方法论和实践进步,并介绍了一个概念框架来导航它们。首先,我们列举了与流行病预测相关的大量流行病学数据集和新的数据流,捕获了各种因素,例如有症状的在线调查,零售和商业,流动性,基因组学数据等。接下来,我们将讨论关注最近基于数据驱动的统计和深度学习方法的方法和建模范式,以及将机械模型知识域知识与统计方法的有效性和灵活性相结合的新型混合模型类别。我们还讨论了这些预测系统的现实部署中出现的经验和挑战,包括预测信息。最后,我们重点介绍了整个预测管道中发现的一些挑战和开放问题。
translated by 谷歌翻译
评估能源转型和能源市场自由化对资源充足性的影响是一种越来越重要和苛刻的任务。能量系统的上升复杂性需要足够的能量系统建模方法,从而提高计算要求。此外,随着复杂性,同样调用概率评估和场景分析同样增加不确定性。为了充分和高效地解决这些各种要求,需要来自数据科学领域的新方法来加速当前方法。通过我们的系统文献综述,我们希望缩小三个学科之间的差距(1)电力供应安全性评估,(2)人工智能和(3)实验设计。为此,我们对所选应用领域进行大规模的定量审查,并制作彼此不同学科的合成。在其他发现之外,我们使用基于AI的方法和应用程序的AI方法和应用来确定电力供应模型的复杂安全性的元素,并作为未充分涵盖的应用领域的储存调度和(非)可用性。我们结束了推出了一种新的方法管道,以便在评估电力供应安全评估时充分有效地解决当前和即将到来的挑战。
translated by 谷歌翻译
我们基于技能评分,对确定性太阳预测进行了首次全面的荟萃分析,筛选了Google Scholar的1,447篇论文,并审查了320篇论文的全文以进行数据提取。用多元自适应回归样条模型,部分依赖图和线性回归构建和分析了4,758点的数据库。值得注意的是,分析说明了数据中最重要的非线性关系和交互项。我们量化了对重要变量的预测准确性的影响,例如预测范围,分辨率,气候条件,区域的年度太阳辐照度水平,电力系统大小和容量,预测模型,火车和测试集以及使用不同的技术和投入。通过控制预测之间的关键差异,包括位置变量,可以在全球应用分析的发现。还提供了该领域科学进步的概述。
translated by 谷歌翻译
预测组合在预测社区中蓬勃发展,近年来,已经成为预测研究和活动主流的一部分。现在,由单个(目标)系列产生的多个预测组合通过整合来自不同来源收集的信息,从而提高准确性,从而减轻了识别单个“最佳”预测的风险。组合方案已从没有估计的简单组合方法演变为涉及时间变化的权重,非线性组合,组件之间的相关性和交叉学习的复杂方法。它们包括结合点预测和结合概率预测。本文提供了有关预测组合的广泛文献的最新评论,并参考可用的开源软件实施。我们讨论了各种方法的潜在和局限性,并突出了这些思想如何随着时间的推移而发展。还调查了有关预测组合实用性的一些重要问题。最后,我们以当前的研究差距和未来研究的潜在见解得出结论。
translated by 谷歌翻译
本文研究了与可解释的AI(XAI)实践有关的两个不同但相关的问题。机器学习(ML)在金融服务中越来越重要,例如预批准,信用承销,投资以及各种前端和后端活动。机器学习可以自动检测培训数据中的非线性和相互作用,从而促进更快,更准确的信用决策。但是,机器学习模型是不透明的,难以解释,这是建立可靠技术所需的关键要素。该研究比较了各种机器学习模型,包括单个分类器(逻辑回归,决策树,LDA,QDA),异质集合(Adaboost,随机森林)和顺序神经网络。结果表明,整体分类器和神经网络的表现优于表现。此外,使用基于美国P2P贷款平台Lending Club提供的开放式访问数据集评估了两种先进的事后不可解释能力 - 石灰和外形来评估基于ML的信用评分模型。对于这项研究,我们还使用机器学习算法来开发新的投资模型,并探索可以最大化盈利能力同时最大程度地降低风险的投资组合策略。
translated by 谷歌翻译
我们介绍了数据科学预测生命周期中各个阶段开发和采用自动化的技术和文化挑战的说明概述,从而将重点限制为使用结构化数据集的监督学习。此外,我们回顾了流行的开源Python工具,这些工具实施了针对自动化挑战的通用解决方案模式,并突出了我们认为进步仍然需要的差距。
translated by 谷歌翻译
地下水位预测是一个应用时间序列预测任务,具有重要的社会影响,以优化水管理以及防止某些自然灾害:例如,洪水或严重的干旱。在文献中已经报告了机器学习方法以实现这项任务,但它们仅专注于单个位置的地下水水平的预测。一种全球预测方法旨在利用从各个位置的地下水级时序列序列,一次在一个地方或一次在几个地方产生预测。鉴于全球预测方法在著名的竞争中取得了成功,因此在地下水级别的预测上进行评估并查看它们与本地方法的比较是有意义的。在这项工作中,我们创建了一个1026地下水级时序列的数据集。每个时间序列都是由每日测量地下水水平和两个外源变量,降雨和蒸散量制成的。该数据集可向社区提供可重现性和进一步评估。为了确定最佳的配置,可以有效地预测完整的时间序列的地下水水平,我们比较了包括本地和全球时间序列预测方法在内的不同预测因子。我们评估了外源变量的影响。我们的结果分析表明,通过训练过去的地下水位和降雨数据的全球方法获得最佳预测。
translated by 谷歌翻译
A well-performing prediction model is vital for a recommendation system suggesting actions for energy-efficient consumer behavior. However, reliable and accurate predictions depend on informative features and a suitable model design to perform well and robustly across different households and appliances. Moreover, customers' unjustifiably high expectations of accurate predictions may discourage them from using the system in the long term. In this paper, we design a three-step forecasting framework to assess predictability, engineering features, and deep learning architectures to forecast 24 hourly load values. First, our predictability analysis provides a tool for expectation management to cushion customers' anticipations. Second, we design several new weather-, time- and appliance-related parameters for the modeling procedure and test their contribution to the model's prediction performance. Third, we examine six deep learning techniques and compare them to tree- and support vector regression benchmarks. We develop a robust and accurate model for the appliance-level load prediction based on four datasets from four different regions (US, UK, Austria, and Canada) with an equal set of appliances. The empirical results show that cyclical encoding of time features and weather indicators alongside a long-short term memory (LSTM) model offer the optimal performance.
translated by 谷歌翻译
在收获前的作物产量的准确预测对于世界各地的作物物流,市场计划和食物分配至关重要。产量预测需要在延长的时间段内监测物候和气候特征,以模拟农作物发育中涉及的复杂关系。绕过世界各种卫星提供的遥感卫星图像是获取数据预测数据的廉价且可靠的方法。目前,收益率预测的领域由深度学习方法主导。尽管使用这些方法达到的精度是有希望的,但所需的数据量和``Black-Box''性质可以限制深度学习方法的应用。可以通过提出一条管道将遥感图像处理为基于特征的表示形式来克服局限性,该图像允许使用极端梯度提升(XGBoost)进行产量预测。与基于深度学习的最先进的收益率预测系统相比,对美国大豆产量预测的比较评估显示出了有希望的预测准确性。特征重要性将近红外光谱视为我们模型中的重要特征。报告的结果暗示了XGBoost进行产量预测的能力,并鼓励将来对XGBoost进行XGBoost的实验,以对世界各地的其他农作物进行产量预测。
translated by 谷歌翻译
天然气管道中的泄漏检测是石油和天然气行业的一个重要且持续的问题。这尤其重要,因为管道是运输天然气的最常见方法。这项研究旨在研究数据驱动的智能模型使用基本操作参数检测天然气管道的小泄漏的能力,然后使用现有的性能指标比较智能模型。该项目应用观察者设计技术,使用回归分类层次模型来检测天然气管道中的泄漏,其中智能模型充当回归器,并且修改后的逻辑回归模型充当分类器。该项目使用四个星期的管道数据流研究了五个智能模型(梯度提升,决策树,随机森林,支持向量机和人工神经网络)。结果表明,虽然支持向量机和人工神经网络比其他网络更好,但由于其内部复杂性和所使用的数据量,它们并未提供最佳的泄漏检测结果。随机森林和决策树模型是最敏感的,因为它们可以在大约2小时内检测到标称流量的0.1%的泄漏。所有智能模型在测试阶段中具有高可靠性,错误警报率为零。将所有智能模型泄漏检测的平均时间与文献中的实时短暂模型进行了比较。结果表明,智能模型在泄漏检测问题中的表现相对较好。该结果表明,可以与实时瞬态模型一起使用智能模型,以显着改善泄漏检测结果。
translated by 谷歌翻译