智能论文笔记

Does Deep Learning REALLY Outperform Non-deep Machine Learning for Clinical Prediction on Physiological Time Series?

Ke Liao , Wei Wang , Armagan Elibol , Lingzhong Meng , Xu Zhao , Nak Young Chong

分类：机器学习 | 人工智能

2022-11-11

Machine learning has been widely used in healthcare applications to approximate complex models, for clinical diagnosis, prognosis, and treatment. As deep learning has the outstanding ability to extract information from time series, its true capabilities on sparse, irregularly sampled, multivariate, and imbalanced physiological data are not yet fully explored. In this paper, we systematically examine the performance of machine learning models for the clinical prediction task based on the EHR, especially physiological time series. We choose Physionet 2019 challenge public dataset to predict Sepsis outcomes in ICU units. Ten baseline machine learning models are compared, including 3 deep learning methods and 7 non-deep learning methods, commonly used in the clinical prediction domain. Nine evaluation metrics with specific clinical implications are used to assess the performance of models. Besides, we sub-sample training dataset sizes and use learning curve fit to investigate the impact of the training dataset size on the performance of the machine learning models. We also propose the general pre-processing method for the physiology time-series data and use Dice Loss to deal with the dataset imbalanced problem. The results show that deep learning indeed outperforms non-deep learning, but with certain conditions: firstly, evaluating with some particular evaluation metrics (AUROC, AUPRC, Sensitivity, and FNR), but not others; secondly, the training dataset size is large enough (with an estimation of a magnitude of thousands).

translated by 谷歌翻译

科学家越来越依靠Python工具使用丰富的，类似于Numpy的表达式执行可扩展的分布式内存阵列操作。但是，这些工具中的许多工具都依赖于针对抽象任务图进行了优化的动态调度程序，这些调度图通常遇到内存和网络带宽相关的瓶颈，这是由于亚最佳数据和操作员的放置决策。在消息传递接口（MPI）（例如Scalapack和Slate）上构建的工具具有更好的缩放属性，但是这些解决方案需要使用专门的知识。在这项工作中，我们提出了NUMS，这是一个数组编程库，可在基于任务的分布式系统上优化类似Numpy的表达式。这是通过称为负载模拟层次调度（LSHS）的新型调度程序来实现的。 LSHS是一种本地搜索方法，可通过最大程度地减少分布式系统中任何给定节点上的最大内存和网络加载来优化操作员放置。再加上用于负载平衡数据布局的启发式，我们的方法能够在某些常见的数值操作上达到通信下限，我们的经验研究表明，LSHS通过减少2倍的降低2倍来增强RAR上的性能，需要减少4倍的内存，，在逻辑回归问题上减少10倍的执行时间。在Terabyte尺度数据上，NUMS在DGEMM上实现了竞争性能，与Dask ML和Spark的Mllib相比，在键盘分解的密钥操作中，DASK高达20倍的速度以及logistic回归的2倍加速。

translated by 谷歌翻译

在这项研究中，我们提出了一种分布算法（HEDA）的混合估计来解决联合分层和样本分配问题。这是一种复杂的问题，其中每个可能分层的每个分层的每个分层的质量都被测量其最佳样本分配。佳航是随机黑盒优化算法，可用于估计，构建和采样概率模型在寻找最佳分层中。在本文中，我们通过添加模拟退火算法来提高EDA的开发属性，使其成为混合EDA。原子和连续地层的经验比较结果表明，与使用分组遗传算法，模拟退火算法或爬山算法相同数据的基准测试相比，HEDA达到了最佳结果。但是，对HEDA的执行时间和总执行较高。

translated by 谷歌翻译

该研究结合了模拟退火与三角洲评估来解决联合分层和样品分配问题。在这个问题中，原子地层被划分为互斥且统称的地层。原子地层的每个分区是分层问题的可能解决方案，其质量通过其成本来衡量。对于甚至适度数量的原子地层而言，可能的溶液的响铃数量是巨大的，并且在每种溶液的评估时间中加入额外的复杂性。许多更大的组合优化问题不能解决最优性，因为寻找最佳解决方案需要禁止的计算时间。已经为此问题设计了许多本地搜索启发式算法，但这些可能会被捕获在局部最小值中，以防止进一步的改进。我们添加了现有的本地搜索算法套件，这是一种模拟退火算法，其允许从局部最小值转发并使用增量评估来利用连续解决方案之间的相似性，从而降低了评估时间。我们将模拟退火算法与两个最近的两个算法进行了比较。在这两种情况下，模拟退火算法在相当较少的计算时间内实现了相当质量的解决方案。

translated by 谷歌翻译