机器学习正在从艺术和科学转变为可供每个开发人员使用的技术。在不久的将来,每个平台上的每个应用程序都将包含训练有素的模型,以编码开发人员无法创作的基于数据的决策。这提出了一个重要的工程挑战,因为目前数据科学和建模在很大程度上与标准软件开发过程脱钩。这种分离使得在应用程序内部的机器学习能力不必要地变得困难,并且进一步阻碍了开发人员将MLin置于首位。在本文中,我们介绍了ML .NET,这是一个在过去十年中在Microsoft开发的框架,用于应对在大型软件应用程序中轻松发布机器学习模型的挑战。我们提出了它的架构,并阐明了形成它的应用程序需求。具体而言,我们引入了DataView,它是ML .NET的核心数据抽象,它可以有效地,一致地捕获完整的预测管道,并在训练和推理生命周期中进行。我们结束了论文,对ML .NET进行了令人惊讶的有利的性能研究,与更多的接受者相比,并讨论了一些经验教训。
translated by 谷歌翻译
Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. More recently, architecture has been proposed to use edge computing for data stream processing. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. Elasticity becomes even more challenging in highly distributed environments comprising edge and cloud computing resources. This work examines some of these challenges and discusses solutions proposed in the literature to address them.
translated by 谷歌翻译
越来越需要将机器学习引入各种各样的硬件设备。当前的框架依赖于特定于供应商的运营商库,并针对窄范围的服务器级GPU进行优化。部署工作负载平台 - 例如移动电话,嵌入式设备和加速器(例如,FPGA,ASIC) - 需要大量的手动工作。我们建议TVM,acompiler公开图形级和运营商级优化,以提供跨不同硬件后端的深度学习工作负载的性能可移植性。 TVM解决了深度学习所特有的优化挑战,例如高级操作员融合,映射到任意硬件原语以及内存延迟隐藏。它还通过采用一种新颖的,基于学习的成本建模方法,自动将低级程序优化到硬件特性,以便快速探索代码优化。实验结果表明,TVM可以提供跨硬件后端的性能,这些后端与用于低功耗CPU,移动GPU和服务器级GPU的最先进的手动调整库相比具有竞争力。我们还展示了TVM针对新加速器后端的能力,例如基于FPGA的通用深度学习加速器。该系统是在几家大公司内部开源和生产使用的。
translated by 谷歌翻译
下一代AI应用程序将持续与环境交互并从这些交互中学习。这些应用程序在性能和灵活性方面都要求新的和苛刻的系统要求。在本文中,我们考虑这些要求并提出Ray --- adistributed系统来解决它们。 Ray实现了一个统一的接口,可以同时执行任务并行和基于actor的计算,并由单一动态执行引擎支持。为了满足性能要求,Ray采用了分布式调度程序和分布式容错存储来管理系统的控制状态。在我们的实验中,我们展示了超过180万次每秒任务的扩展,并且比几个具有挑战性的强化学习应用的现有专用系统具有更好的性能。
translated by 谷歌翻译
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control , efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring , and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.
translated by 谷歌翻译
机器学习工作流程开发是一个反复试验的过程:开发人员通过测试小的修改来迭代工作流程,直到达到所需的准确性。不幸的是,现有的机器学习系统只关注模型训练 - 只占整个开发时间的一小部分 - 而忽略了解决迭代开发问题。我们建议使用Helix,amachine学习系统来优化执行情况 - 智能地缓存和重用,或者重新计算中间体。 Helix在其斯卡拉DSL中捕获了各种各样的应用程序需求,其简洁的语法定义了数据处理,模型规范和学习的统一过程。我们证明了重用问题可以被转换为Max-Flow问题,而缓存问题则是NP-Hard。我们为后者开发有效的轻量级启发式算法。 Empiricalevaluation显示Helix不仅能够在一个统一的工作流程中处理各种各样的用例,而且速度更快,在四个实际上提供比最先进系统(如DeepDive或KeystoneML)高达19倍的运行时间减少。世界在自然语言处理,计算机视觉,社会和自然科学中的应用。
translated by 谷歌翻译
Cloud computing is a recent advancement wherein IT infrastructure and applications are provided as 'services' to end-users under a usage-based payment model. It can leverage virtualized services even on the fly based on requirements (workload patterns and QoS) varying with time. The application services hosted under Cloud computing model have complex provisioning, composition, configuration, and deployment requirements. Evaluating the performance of Cloud provisioning policies, application workload models, and resources performance models in a repeatable manner under varying system and user configurations and requirements is difficult to achieve. To overcome this challenge, we propose CloudSim: an extensible simulation toolkit that enables modeling and simulation of Cloud computing systems and application provisioning environments. The CloudSim toolkit supports both system and behavior modeling of Cloud system components such as data centers, virtual machines (VMs) and resource provisioning policies. It implements generic application provisioning techniques that can be extended with ease and limited effort. Currently, it supports modeling and simulation of Cloud computing environments consisting of both single and inter-networked clouds (federation of clouds). Moreover, it exposes custom interfaces for implementing policies and provisioning techniques for allocation of VMs under inter-networked Cloud computing scenarios. Several researchers from organizations, such as HP Labs in U.S.A., are using CloudSim in their investigation on Cloud resource provisioning and energy-efficient management of data center resources. The usefulness of CloudSim is demonstrated by a case study involving dynamic provisioning of application services in the hybrid federated clouds environment. The result of this case study proves that the federated Cloud computing model significantly improves the application QoS requirements under fluctuating resource and service demand patterns.
translated by 谷歌翻译
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed , data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
translated by 谷歌翻译
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks and applications. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the Tensor-flow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve accuracy and render more robust predictions.
translated by 谷歌翻译
本报告描述了18个项目,这些项目探讨了如何在国家实验室中将商业云计算服务用于科学计算。这些演示包括在云环境中部署专有软件,以利用已建立的基于云的分析工作流来处理科学数据集。总的来说,这些项目非常成功,并且他们共同认为云计算可以成为国家实验室科学计算的宝贵计算资源。
translated by 谷歌翻译
Spark SQL is a new module in Apache Spark that integrates rela-tional processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself , offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
translated by 谷歌翻译
机器学习,数据库和硬件设计的进步推动了数据革命。可编程加速器正在独立地进入每个区域。因此,存在无解决方案,这些解决方案能够在这些不相交的字段的交叉点处实现硬件加速。这篇论文是迈向高级分析(DANA)数据库加速的统一解决方案的第一步。部署专用硬件(如FPGA)进行数据库内分析目前需要手工设计硬件并手动路由数据。相反,DAnA自动将高级分析查询的高级规范映射到FPGA加速器。为用户定义函数(UDF)生成加速器实现,使用Python嵌入式Domain-SpecificLanguage(DSL)表示为SQL查询的一部分)。为了实现高效的数据库内集成,DAnAaccelerators包含一个新颖的硬件结构Striders,它直接与数据库的缓冲池接口。 Striders提取,清理和处理由执行分析算法的多线程FPGA引擎消耗的训练数据元组。我们将DAnA与PostgreSQL集成,为一系列运行多种ML算法的真实世界和合成数据集生成硬件加速器。结果表明,DAnA-enhancedPostgreSQL平均为真实数据集提供8.3倍的端到端加速,最大为28.2倍。此外,DAnA增强型PostgreSQL平均比在Greenplum上运行的多线程Apache MADLib快4.0倍。 DAnA提供这些优势,同时从数据科学家隐藏硬件设计的复杂性,并允许他们在= 30-60行的Python中表达算法。
translated by 谷歌翻译
具有卷积和循环网络的深度学习模型现在已经存在,并且可以分析大量的音频,图像,视频,文本和图形数据,应用于自动翻译,语音到文本,场景理解,用户偏好排名,广告放置等。竞争框架构建这些网络,如TensorFlow,Chainer,CNTK,Torch / PyTorch,Caffe1 / 2,MXNet和Theano,探索在可用性和表现力,研究或生产方向和支持的硬件之间的不同权衡。它们运行在计算运算符的DAG上,包括高性能库,如CUDNN(用于NVIDIA GPU)或NNPACK(用于各种CPU),并自动执行内存分配,同步和分配。需要定制运算符,其中计算不适合现有的高性能图书馆电话,通常费用很高。当研究人员发明新的操作员时,这通常是必需的:这样的操作员遭受严重的性能损失,这限制了创新的速度。此外,即使这些框架可以使用现有的运行时调用,它通常也不能为用户特定的网络体系结构和数据集提供最佳性能,缺少操作员之间的优化以及可以在知道数据大小和形状的情况下完成的优化。我们的贡献包括(1)一种接近深度学习数学的语言,称为Tensor Comprehensions,(2)多面体即时编译器,用于将深度学习DAG的数学描述转换为具有委托内存管理和同步的CORA内核,同时提供优化例如运算符融合和特定大小的专门化,(3)由自动调整器填充的编译高速缓存。 [摘要截止]
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production , we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, "commercial" search engines use hidden algorithms that put the integrity of their results in doubt, collect user data that raises privacy concerns, and target the general public thus fail to serve the needs of specific search users. Open source search, like open source operating systems, offers alternatives. The goal of the Open Source Information Retrieval Workshop (OSIR) is to bring together practitioners developing open source search technologies in the context of a premier IR research conference to share their recent advances, and to coordinate their strategy and research plans. The intent is to foster community-based development, to promote distribution of transparent Web search tools, and to strengthen the interaction with the research community in IR. A workshop about Open Source Web Information Retrieval was held last year in Compigne, France as part of WI 2005. The focus of this worksop is broadened to the whole open source information retrieval community. We want to thank all the authors of the submitted papers, the members of the program committee:, and the several reviewers whose contributions have resulted in these high quality proceedings. ABSTRACT There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process. This work is based on a different set of assumptions than previous work. First, we focus on latency instead of through-put. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient. We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.
translated by 谷歌翻译
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between ver-tices.
translated by 谷歌翻译
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
translated by 谷歌翻译
性能不可预测性是云采用的主要障碍,并且具有性能,成本和收入分支。随着云服务从单片设计tomicroservices转变,可预测的性能变得更加重要。在具有微服务的系统中发生QoS违规后检测它们会导致恢复时间过长,因为热点会在相关服务中传播和放大。我们介绍了Seer,一个在线云性能调试系统,它利用深度学习和大量的跟踪数据云系统收集来学习转化为QoS违规的空间和时间模式。 Seer将轻量级分布式RPC级别跟踪与详细的低级硬件监控相结合,以表示即将发生的QoS违规,并诊断出不可预测的性能来源。一旦检测到严重的QoS违规,Seer就会通知集群管理器采取措施以避免性能完全降低。我们评估了Seer的本地集群,以及使用具有数百个用户的微服务构建的端到端应用程序的大规模部署。我们表明Seer在91%的时间内正确地反映了QoS违规,并避免了84%的情况下的QoS违规。最后,我们展示了Seer可以识别应用级设计错误,并提供有关如何更好地构建微服务以实现可预测性能的见解。
translated by 谷歌翻译
虽然机器学习(ML)领域正在迅速发展,但是为了实现广泛采用所需的“学习系统”的发展存在相对滞后。此外,很少有这样的系统被设计用于支持科学ML的特殊要求。在这里,我们展示了科学数据和学习中心(DLHub),这是一个多租户系统,提供模型库和服务功能,侧重于科学应用.DLHub解决了当前系统中的两个重要缺点。首先,其自助服务模型存储库允许用户共享,发布,验证,复制和重用模型,并通过打包和分发模型以及所有组成组件来解决与模型再现性相关的问题。其次,它实现了可扩展和低延迟的服务功能,可以利用并行和分布式计算资源,通过简单的Web界面使对已发布模型的访问民主化。与其他模型服务框架不同,DLHub可以存储和提供任何与Python 3兼容的模型或处理功能,以及多功能管道。我们展示了相对于包括TensorFlow Serving,SageMaker和Clipper在内的其他模型服务系统,DLHub提供了更强大的功能,在没有记忆和批处理的情况下具有可比性,并且在后两种技术可以使用时性能显着提高。我们还描述了DLHub在科学应用中的早期用途。
translated by 谷歌翻译